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1  Introduction 


In  this  final  report,  we  will  discuss  three  years’  research  supported  by  the  Air  Force  Office  of  Scientific 
Research  under  AFOSR  grant  88-0240  from  1988  to  1991.  The  discussion  will  be  from  the  instrumental 
point  of  view  of  current  and  future  research,  represented  by  the  successful  renewal  proposal  [1]  (of  which 
this  report  is  a  summary),  rather  than  from  a  historical  point  of  view. 

Much  of  our  research  has  been  based  on  the  premise  is  that  mathematical  methods  and  notation 
associated  with  constrained  optimization  should  be  used  to  specify  a  neural  net,  which  can  then  be 
compiled  to  diverse  implementations.  But  where  do  we  get  such  a  compiler?  And  what  are  the  details 
of  this  mathematical  notation?  We  have  made  substantial  progress  on  these  research  questions: 

1.  We  have  developed  mathematical  methods  that  can  transform  one  algebraic  NN  description  into 

another,  more  implementable  one.  These  developments  were  attained  by  serious  work  in  the 
applied  mathematics  of  neural  nets.  They  can  form  the  basis  of  a  neural  compiler  because  they 
address  most  of  the  major  NN  compilation  and  implementation  issues.  But  they  do  not  yet  suffice. 

2.  We  have  been  accumulating  the  research  in  a  neural  simulator.  It  can  be  expanded  into  a  semi¬ 

automatic  compiler:  a  neural  net  design  and  implementation  environment  based  on  mathematical 
methods. 

3.  We  have  developed  a  mathematical  notation  (not  yet  a  formal  language)  for  describing  complex 

problem  domains  in  terms  of  constrained  optimization  problems.  The  optimization  problems 
can  be  solved  by  neural  nets  as  described  in  point  1.  Our  notation  involves  objective  functions, 
L-system  grammars,  and  maps  between  such  objects. 

The  design  and  implementaion  method  outlined  in  these  assertions  is  illustrated  in  Figure  1.  We 
will  outline  our  achievements  in  each  of  these  areas  of  research,  and  their  relationship  to  a  larger  plan  of 
future  work  on  the  use  of  symbolic  algebra  in  implementing  neural  networks,  in  the  next  section.  The 
final  section  and  the  appendices  will  provide  an  entrance  to  more  detailed  expositions  of  the  work. 

2  Summary  of  Research  Progress 

2.1  Algebraic  Transformations 

Assertion  l.  We  have  developed  mathematical  methods  that  can  transform  one  algebraic  NN  description 
into  another,  more  implementable  one.  These  developments  were  attained  by  serious  work  in  the  applied 
mathematics  of  neural  nets.  They  could  form  the  basis  of  a  neural  compiler  because  they  address  most 
of  the  major  NN  compilation  and  implementation  issues.  But  they  do  not  yet  suffice. 

In  [2]  (included  as  an  Appendix)  we  first  introduced  a  catalog  of  fixedpoint-preserving  transforma¬ 
tions  that  could  be  applied  to  neural  net  objective  functions,  so  as  to  reduce  their  cost  or  increase  their 
implementability  in  some  technology.  The  catalog  is  shown  here  as  equation  1.  This  catalog  is  intended 
ultimately  to  provide  the  same  kind  of  cookbook  approach  to  neural  net  design  that  a  table  of  integrals 
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Problem-modelling 

Hand-design  — — ■ ►  Grammar,  r  Learning 


Figure  1:  A  neural  network  design  methodology.  Solid  arrows  constitute  the  recommended  procedure. 
The  arrow  from  T  to  E  may  be  realized  by  approximations  from  statistical  physics,  such  as  Mean  Field 
Theory.  The  circular  arrow  represents  fixed-point  preserving  transformations  of  objective  functions. 

provides  to  integration:  the  intellectual  work  is  not  magically  eliminated,  but  it  is  considerably  reduced 
through  the  distilled  widsom  of  generations  of  previous  researchers.  The  simplest  example  discussed  in 
[2]  (using  Rule  1.1)  explains  the  then-standard  trick  for  reducing  a  winner-take-all  network  from  0(N 2) 
connections  to  0(N);  that  method  has  been  used  in  a  variety  of  analog  neural  networks  chips  including 
a  stereopsis  chip  by  Delbruck  and  Mead.  This  and  several  other  cataloged  algebraic  transformation 
patterns  could  be  shown  to  be  Legendre  transformations;  others  weren’t  of  this  class,  but  still  preseved 
fixed  points.  All  could  be  applied  mechanically  once  the  decision  to  use  them  was  taken. 

The  limitations  of  the  first  catalog  of  transformations  were  clear:  they  applied  to  optimization-based 
(“Hopfield”)  nets  and  couldn’t  smooth  out  rough  objective  functions.  So  learning,  convergence  speed 
and  global  optimization  were  all  questionable.  Nevertheless  they  gave  the  neural  net  designer  some 
remarkable  capabilities. 

Technology-Specific  Tricks.  For  example  special  neural  transfer  functions,  such  as  log  and  expo¬ 
nential,  are  available  in  CMOS  [3].  Transformations  could  be  used  to  shift  the  corresponding  nonlinearity 
from  a  single  variable  to  an  entire  expression  in  an  objective  function,  greatly  expanding  the  range  of 
iraplementable  objectives.  Also  the  wiring  complexity  of  circuits  other  than  the  winner-take-all  net 
could  be  greatly  reduced  by  introducing  (semi-automatically)  new  linear  interneurons  that  compute 
reusable  expressions.  Graph-matching  networks  were  a  subtle  example,  potentially  central  to  high-level 
vision. 
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Table  of  Algebraic  Transformations.  These  transformations  preserve  fixed  points  of  objectives,  or  summands 
thereof.  X  and  Y  are  any  algebraic  expressions,  containing  any  number  of  variables. 


Parallel  Computer  Implementation.  Furthermore,  linear  interneurons  could  be  introduced 
between  two  previously  connected  neurons  x  and  y: 

Exy  =  xy  —  Exy  =  x{a  -  t)  ^  y(a  -  u)  -  ^a2  +  ^t2  +  ^uj2 .  (2) 

This  innocuous-looking  transformation  has  great  consequences,  for  the  linear  interneuron  a  can  be 
interpreted  as  the  wire  connecting  up  two  different  modules  in  a  partitioned  hardware  implementation 
(e.g.  two  chips  or  two  CPU’s  in  a  parallel  computer).  This  interneuron  can  relax  according  an  entirely 
different  speed  schedule  from  the  other  neurons,  allowing  a  communication  channel  to  be  modelled.  In 
fact  the  physical  wire  carrying  o  may  even  be  multiplexed  with  other  connections  between  the  same 
two  modules.  We  have  performed  preliminary  experiments  [4]  in  which  this  transformation  was  used 
to  split  up  a  large  neural  net  across  2  to  8  processors  of  an  Encore  Multimax  running  Linda,  in  which 
the  observed  substantial  speedup  resulted  solely  from  our  ability  to  vary  the  relative  speed  of  the  o 
interneurons.  In  other  words,  this  transformation  allowed  a  neural  net  parallelization  on  a  coarse¬ 
grained  MIMD  machine  which  would  otherwise  not  have  worked.  (We  will  propose  to  continue  this 
investigation.) 

Controllable  Dynamics.  Also  in  the  first  catalog  was  a  highly  nontrivial  transformation  that 
replaces  the  objective  function  itself  with  a  Lagrangian  that  governs  the  entire  state  space  trajectory, 
including  the  dynamics,  of  a  neural  net,  while  still  forcing  convergence  to  locally  optimal  states  according 
to  the  original  objective.  The  central  problem  was  to  preserve  the  Lagrangian  formalism  as  much  as 
possible  while  allowing  convergence  to  a  fixed  point;  we  did  this  by  replacing  the  conventional  functional 
derivative  with  a  variant  called  the  “greedy  functional  derivative”. 

Virtual  Neurons  and  Attention.  The  aforementioned  algebraic  transformation  from  objective  to 
Lagrangian  can  be  used  to  model  virtual  neurons  and  connections  and  to  optimize  the  kind  of  hardware 
multiplexing  done  in  the  Bell  Labs  ANNA  chip,  which  is  inevitable  for  any  implementation  in  which  a 
large  network  is  mapped  to  a  smaller  but  flexible  circuit  by  using  physical  neurons  to  simulate  different 
virtual  neurons  at  different  times.  More  generally  the  Lagrangian  approach  provides  a  “computational 
attention  mechanism”  for  optimally  choosing  the  most  important  part  of  an  optimization  problem  to 
work  on  next.  In  [5]  we  applied  this  method  to  derive  several  kinds  of  attention  windows  for  a  two- 
dimensional  surface  reconstruction  network,  including  sliding,  jumping  and  rolling  windows  of  attention. 
A  combination  of  rolling  and  jumping  is  expected  to  be  the  most  effective  here;  the  investigation 
continues. 

Multiscale  Acceleration.  Since  the  first  catalog  of  transformations  was  published  we  have  de¬ 
veloped  others.  The  question  of  convergence  speed  was  addressed  first,  with  the  publication  [6]  of  a 
multiscale  acceleration  technique  for  optimization  neural  nets.  It  is  a  generalization  of  the  standard 
highly  effective  multigrid  algorithm  for  solving  partial  differential  equations,  systems  of  ODEs,  or  just 
systems  of  linear  equations.  It  requires  very  few  assumptions  on  the  neural  network  to  be  applied,  and  it 
may  be  viewed  as  a  transformation  which  turns  one  fine-scale  objective  function  into  a  set  of  compatible 
neural  net  objective  functions  at  different  scales,  including  the  original  one  at  the  finest  scale. 
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Figure  2:  A  rolling  window  of  attention. 

Communication  Cost.  Communication  cost  has  been  addressed  by  the  development  of  a  special- 
purpose  message-routing  network  foT  Content  Addressable  Memories  and  other  models  of  recognition  in 
which  the  closest  memory  according  to  some  metric  is  to  be  retrieved.  Once  again  it  may  be  viewed  as  an 
algebraic  transformation,  though  on  a  very  special  form  of  objective  function.  This  algorithm  was  put 
in  its  current  form  by  Professor  Bhatt  at  Yale,  an  expert  on  communication  in  parallel  computers,  and 
we  are  currently  implementing  it  on  a  Connection  Machine.  The  empirical  timing  results  are  consistent 
with  the  theoretical  performance:  simultaneously  matching  N  inputs  to  N  memories  takes  0(N  log  N) 
wires  and  0(log2  N)  time  steps  of  a  communication-bound  feed-forward  algorithm.  We  propose  to 
continue  this  work.  More  generally,  message  routing  and  other  standard  communication  problems  arise 
for  any  sufficiently  large  or  complex  neural  network  implementation,  but  in  an  especially  advantageous 
form  since  problem-specific  knowledge  may  be  used  to  improve  low-level  communication  performance. 
There  is  no  guarantee  that  biological  networks  use  a  general-purpose  message  router  even  if  for  reasons 
of  algorithmic  efficiency  they  must  perform  analogous  functions. 

Learning.  Learning  and  especially  learning  in  feed-forward  nets  has  received  remarkably  short 
shrift  in  our  research  so  far,  since  it  was  not  on  the  main  line  of  technical  development.  However  we 
have  known  for  two  years  that  an  algorithm  extremely  close  to  Pineda’s  recurrent  backpropagation  can 
be  derived  by  transforming  the  standard  squared-error  objective  function  for  learning.  The  difference 
is  only  in  the  dynamics;  the  set  of  fixed  points  is  the  same  as  Pineda’s.  Also  feed-forward  networks 
have  a  trivial  objective  function  themselves  [7],  and  hence  can  be  handled  this  way.  Since  nobody  needs 
another  derivation  of  backpropagation,  this  method  is  of  interest  only  insofar  as  it  can  be  automated  to 
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Figure  3:  Deterministic  annealing,  (a)  Objective  function  at  a  =  .0364  (very  little  smoothing),  (b) 
Objective  function  at  a  —  .300.  Smoothed;  the  minimum  of  this  function  provides  a  good  starting  point 
for  minimizing  (a). 

derive  new  implementations  of  learning  neural  nets  automatically,  under  all  the  constraints  of  limited 
hardware,  communication  and  flexibility  that  have  already  been  discussed.  But  in  addition  to  these 
transformational  considerations,  we  have  a  far  more  important  suggestion  for  how  to  make  substantial 
progress  in  learning.  It  involves  the  grammatical  derivation  of  neural  nets  which  will  be  introduced  in 
section  2.3. 

Global  Optimization.  Effective  approximations  to  global  optimization  via  neural  nets  look  con¬ 
siderably  more  tractable  since  the  recent  flurry  of  research  in  deterministic  annealing  methods  [8,  9,  10], 
none  of  which  we  were  responsible  for.  These  methods  smooth  a  bumpy  objective  function  and  are 
often  clearly  automatable  as  objective  function  transformations,  in  which  some  of  the  constraints  in  a 
constrained  optimization  problem  can  be  imposed  exactly  rather  than  by  penalty  terms.  An  interesting 
variety  of  new  neural  net  transfer  functions  and  dynamical  systems  result,  and  we  continually  find  more 
by  working  on  neural  networks  for  computer  vision  [11].  Another  approach  to  removing  local  minima 
will  be  described  in  the  next  section. 

One  example  of  the  smoothing  effect  of  deterministic  annealing  is  our  registration  network  for  line- 
segment  images  [2,  12],  in  which  the  objective  is  a  bumpy  function  of  a  global  dispacement  Ax.  The 
objective  and  a  smoothed  version  are  shown  in  figure  3. 

Summary  of  Algebraic  Transformations.  The  algebraic  transformations  above  are  far  from  a 
complete  set  that  would  suffice  for  a  mathematical  compiler,  since  some  are  still  under  development  and 


we  have  not  yet  tried  very  many  target  architectures  or  input  neural  nets,  and  since  we  have  not  yet 
fully  automated  the  procedure.  Also  there  is  an  opportunity,  demonstrated  by  the  log  and  exponential 
transformations  available  for  analog  CMOS,  to  build  a  library  of  technology-specific  tricks  to  minimize 
implementation  costs  in  important  target  technologies.  But  the  transformations  we  studied  do  show 
that  most  of  the  main  engineering  issues  that  generate  hardware  diversity  and  that  a  compiler  would 
have  to  resolve  in  translating  objective  functions  to  neural  net  implementations  fall  within  the  purview 
of  algebraic  transformations. 

2.2  Software  Engineering 

Assertion  2.  We  have  been  accumulating  the  research  in  a  neural  simulator.  It  could  be  expanded  into  a 
semi-automatic  compiler:  a  neural  net  design  and  implementation  environment  based  on  mathematical 
methods. 

Algebraic  Data  Stricture.  Our  present  simulator  is  based  on  the  algebra  of  objective  functions. 
The  data  structure  holding  the  neural  net  is  itself  regarded  (and  implemented)  as  a  large  objective 
function  in  which  synapses  are  monomial  summands.  Constraints  can  be  added  to  the  net  via  quadratic 
penalty  functions  (e.g.  expanded  out  to  make  many  new  synapses  on  existing  neurons)  or  lagrange 
multiplier  neurons  [13].  Optimization  methods  such  as  gradient  descent,  conjugate  gradient,  deter¬ 
ministic  annealing  and  multiscale  methods  are  packaged  as  “optimizers”  which  could  be  applied  to  a 
variety  of  optimization  problems,  both  in  learning  and  running  a  network.  In  this  way  optimization  is 
modularized. 

Class  Libraries.  Likewise  “container  classes”  and  a  neuron  indexing  scheme  modularize  our  data 
structure  code,  and  the  use  of  Interviews  (a  C++  interface  to  X-windows)  modularizes  a  graphical  user 
interface.  We  are  experimenting  with  parallel  implementations  using  Linda,  a  portable  programming 
language  extension.  This  is  currently  done  with  “fragment”  objects  which  implement  a  kind  of  domain 
decomposition  for  neural  nets.  We  have  used  the  simulator  mostly  for  problems  in  high-level  vision, 
a  rich  domain  of  considerable  interest  in  its  own  right.  The  simulator  is  in  C++  and  benefits  from 
object-oriented  design;  in  particular  the  “simulator”  is  not  so  much  a  single  program  as  it  is  a  reusable 
class  library  for  neural  simulation.  It  is  possible  that  in  the  future  we  will  also  do  object-oriented 
programming  in  CLOS  (the  Common  Lisp  Object  System)  or  even  Mathematica  for  symbolic  algebra 
manipulation. 

Towards  a  Compiler.  In  order  to  expand  this  software  into  a  semi-automatic  compiler  it  would  be 
necessary  to  make  it  a  bit  more  like  a  computer  algebra  system:  more  interactive  and  with  more  syntactic 
forms  recognized.  Indeed  the  compiler  probably  should  be  prototyped  within  an  existing  computer 
algebra  system  such  as  Mathematica.  For  example  here  is  transformation  Rule  1.1,  L X 2  — 1 >  A'cr  —  \cr2 ■ 
superficially  translated  into  Mathematica: 

(*  Initialize  the  list  of  reversed  neurons  *) 

ReversedNeurons  =  O 
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(*  We  have  to  remove  protection  from  the  Power  function  so  that  we 
can  attach  a  new  transformation.  The  transformation  will  only 
be  used  when  the  expression  x  is  more  complex  than  a  single  number 
or  variable.  *) 

Unprotect [Power] 

x_~2  :=  Quadrat icTransform[x]  /;  !AtomQ[x] 

Protect [Power] 

(*  A  new  symbol  is  created  with  Unique  and  added  to  the  list  of  reversed 
neurons.  The  final  line  is  the  transformed  expression.  *) 

Quadrat icTransf orm[x_]  :»  Block[{usim  =  Unique ["sigma"] } , 
AppendTo[ReversedNeurons ,  usim] ; 

2  x  usim  -  usim'2] 

and  here  is  a  simple  example  of  its  use: 

In[l] :=  <<rulel.l.m 

In[2]:=  (x+y)*2  (*  x+y  can  be  transformed  *) 


2 

0ut[2]=  -sigmal  +  2  sigmal  (x  +  y) 

The  resulting  expression  could  then  be  exported  to  a  neural  simulator  or  other  implementation  code.  For 
the  full  compiler  system,  graphical  tools  analogous  to  standard  circuit  design  tools  would  be  included. 
For  example  each  valid  transformation  rule  could  have  its  own  pop-up  window  by  which  a  human  user 
could  direct  or  modulate  its  application.  Such  interfaces  have  become  far  easier  to  construct  in  the  last 
few  years  due  to  tools  like  Interviews  and  its  interactive  builder  of  user  interfaces,  IBuild.  Gradually 
one  could  develop  even  greater  automation,  in  which  the  choice  of  transformation  rule  and  its  locus  of 
action  is  also  automated. 

Software  Summary  The  purpose  of  such  software  engineering  is  not  to  develop  a  product  but 
to  demonstrate  the  feasability  of  a  new  kind  of  software  tool  for  neural  networks,  based  on  serious 
mathematical  methods  which  can  be  encapsulated  as  algebraic  transformations,  and  to  support  research 
into  both  neural  net  design  and  implementation. 
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2.3  Mathematical  Notation  for  Complex  Problem  Domains 

Assertion  3.  We  have  developed  a  mathematical  notation  (not  yet  a  formal  language )  for  describing 
complex  problem  domains  in  terms  of  constrained  optimization  problems.  The  optimization  problems 
can  be  solved  by  neural  nets  as  described  in  Assertion  1.  Our  notation  involves  objective  functions, 
L-system  grammars,  and  maps  between  such  objects. 

How  to  Compose  Diverse  Objective  Functions?  So  far  we  have  described  the  research  situation 
and  opportunities  as  if  it  were  always  not  only  possible  but  even  easy  to  formulate  neural  net  applications 
as  constrained  optimization  problems.  Unfortunately  our  experience  is  that  it  is  generally  possible  but 
not  trivial  to  get  relatively  compact  problem  specifications  this  way.  We  now  perceive  a  need  for  a  coarser 
level  of  structure  -  a  symbolic  and  expressive  programming  language  -  which  can  “compose”  or  glue 
together  many  individual  objective  functions  and  algebraic  constraints  describing  different  aspects  of  a 
problem  into  a  single  modular  problem  description.  Yet  we  do  not  want  to  give  up  the  unique  advantages 
(learning,  circuit  implementation)  that  neural  nets  gained  by  leaving  traditional  programming  languages 
behind.  Recognizing  and  resolving  this  conflict  is  a  major  result  of  our  research  of  the  past  few  years. 

Grammars  Can  Compose  Objectives.  Our  solution  comes  from  the  world  of  L-systems,  parallel 
grammars  originally  introduced  by  Lindenmayer  [14,  15]  to  describe  plant  growth.  We  introduce  gram¬ 
mars  (see  Box  1)  whose  production  rules  are  each  governed  by  a  Boltzmann  probability  distribution 
(i.e.  by  an  algebraic  objective  function)  which  specifies  when  and  how  the  rule  may  fire.  Ordinary 
L-systems  are  attractive  for  physically-based  computation  because  they  can  directly  model  highly  par¬ 
allel  dynamical  systems  and,  with  just  one  rule,  fractal  growth  processes.  L-systems  augmented  with 
objective  functions  inherit,  for  each  grammar  rule,  all  of  the  power  of  constrained  optimization  we  have 
advertised  and  exploited  in  vision.  A  grammar  with  many  rules  can  correctly  compose  many  diverse 
objective  functions  and  create  one  organized  problem  specification.  We  have  detailed  a  number  of  appli¬ 
cations  of  such  objective  function  grammars  to  vision  in  [12],  including  the  transformational  derivation 
of  neural  nets  as  in  equation  1.  Generally,  one  could  say  that  these  applications  demonstrate  the  use 
of  grammars  in  stating  and  solving  visual  pattern  recognition  problems.  This  is  especially  true  where 
other  neural  methods  fail:  in  making  use  of  regularities  which  are  abstract  and  remote  from  the  pixel 
level.  Another  application  of  connectionist  grammars,  to  the  modelling  of  biological  development,  is 
described  in  [16]. 

Grammars  support  systems  integration.  One  of  the  main  sources  of  hardware  diversity  in 
artificial  neural  network  design  is  the  way  in  which  a  neural  net  is  to  be  integrated  with  the  rest  of 
a  computer  system.  Algebraic  transformations  were  proposed  for  connecting  up  different  hardware 
modules  in  section  2.1.  At  the  software  level,  almost  all  software  architectures  will  be  far  more  easily 
integrated  with  a  grammar  than  with  a  neural  net,  since  computer  languages  of  all  sizes  are  fundamental 
in  computer  science  and  practice.  In  connectionist  grammars  we  have  a  natural  way  to  mix  grammars 
with  neural  nets,  which  we  think  will  support  the  integration  of  neural  nets  into  generic  computer 
systems. 

Learning  in  Grammars.  Connectionist  grammars  raise  a  fundamental  new  opportunity  in  learn¬ 
ing.  The  reason  is  that  machine  learning,  including  neural  net  learning,  is  fundamentally  dominated  by 


10 


Box  1.  Connectionist  Grammars. 


Consider  an  object  with  a  hierarchical  decomposition  into  parts,  with  internal  degrees  of  fieedom 
describing  the  relative  positions  of  the  parts.  For  random  dot  features,  the  resulting  images  will 
generally  be  clusters  of  dots  with  unpredictable  jitter  of  both  the  dot  and  the  cluster  positions. 
A  model  of  such  an  object  is  given  by  this  grammar: 


model 

locations 

r°  :  root  —  instance  of  model  a  at  x 

£o(x)  =  IX|2 

jittered  cluster 
locations 

T1  :  instancefor,  x)  —  {cluster(a,  c.  xc)} 

£j({xc})  =  27^  EJxc  “  x  ~  <|2,  where  <  u“  >c  =  0 

jittered  dot 
locations 

f2  :  cluster(a,  c,  xc)  —  {dot(c,  m,  xcm)} 

£2({Xem})  =  2 7^  Em  lx<m  -  Xc  -  u“J2,  where  <  u“m  >m=  0 

scramble 
all  dots 

r3  :  {dot(c,  m,  xcm)}  —  {imagedot(x,  =  Jjcm  Pcm ,,xcm)} 

^3({X,})  =  -log  JJ<5(X;  “  E 

i  cm 

where  JT  Pm,;  =  1  A  Jjm  Pm,,  =  1 

which  is  illustrated  endow: 


instance 


1.1  1,2 

•  • 

1.3  •  a  1.4 


1.1  1.2 

•  • 

1.3  •  «  1.4 


•  • 
•  • 


2,* 


3,* 


•  • 

(unordered  dots) 


The  corresponding  probability  distribution  is: 


Pr3(«,x,{xc},{x,})  ^  )  E 

'  f  P|  P  is  a  1 

^  nprmutation  * 


permutation 

Ecm.Pcm,.(^I|x|2  +  I^-|Xc-X-U“|2  +  5Jr|X.-Xc-U2 
\  r  cd  Jt 


(4) 


where  C  is  the  number  of  clusters  and  N/C  is  the  number  of  dots  in  each  cluster.  From  this 
expression.  on«  can  derive  a  neural  net  to  recognize  which  model  is  present  in  the  final  image. 
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issues  of  representation:  what  are  the  input  and  output  representations,  and  what  kinds  of  internal  rep¬ 
resentations  can  the  learner  actually  create?  Yet  grammars  offer  representational  flexibility  approaching 
to  that  of  conventional  symbolic  programming  languages,  in  a  quantitative  and  probabilistic  context 
that  makes  scoring  functions  continuous  and  makes  gradient-descent  training  possible.  In  other  words, 
the  best  features  of  AI  languages  (flexible  representation)  and  neural  net  parameterizations  (learning 
and  generalization)  are  combined  to  make  a  mathematical  object  (the  connectionist  grammar)  for  which 
both  the  human  designer  and  the  learning  dynamics  can  deeply  influence  and  vary  the  representations 
used. 

Our  proposed  methods  for  learning  a  grammar  have  been  spelled  out  in  the  last  section  of  [12], 
from  which  we  now  quote.  “Three  possible  methods  for  learning  a  grammar  are  suggested  here.  They 
all  assume  that  learning  the  grammar  can  be  expressed  as  tuning  its  parameters,  as  is  the  case  for 
unstructured  neural  networks.  First,  the  kind  of  grammar  we  have  been  studying  could  be  augmented 
with  an  initial  set  of  ‘metagrammar’  rules,  which  randomly  choose  the  parameters  of  the  permanent 
models  and  then  generate  many  images  by  the  usual  grammar.  The  task  of  inferring  the  permanent 
models’  parameters  is  just  another  Bayesian  inference  problem,  stretched  out  over  many  images.  Second, 
one  could  minimize  the  Kullback  information  [17]  between  the  probability  distributions  of  an  unknown 
grammar,  images  from  which  the  perceiver  sees,  and  a  parameterized  grammar.  This  algorithm  would  be 
similar  to  the  ‘Boltzmann  machine’  for  neural  network  learning  [18].  Finally  one  could  look  for  clusters 
in  model  space  by  defining  a  distance  ‘metric’  D  between  images  and  mathematically  projecting  it  back 
though  the  grammar.”  (There  follows  a  general-purpose  candidate  D.)  We  propose  to  continue  in  these 
directions,  which  may  not  all  be  very  different  from  each  other. 

Variable-Binding.  A  standard  problem  in  appying  neural  nets  to  symbolic  reasoning,  as  might  be 
required  in  high-level  vision  or  in  language  processing,  is  their  limited  ability  to  bind  varibles  within  a 
local  context.  Such  limitations  on  the  expressiveness  of  neural  nets  are  burdensome  for  programming 
or  designing  neural  nets.  We  showed  one  optimization-based  way  to  overcome  these  limitations  in 
[19],  where  we  translate  a  simple  frame-based  language  and  its  valid  deductions  into  a  neural  network 
architecture  closely  related  to  our  graph-matching  architecture  for  high-level  vision  nets  [20]. 

Maps  Between  Grammars.  A  connectionist  grammar  is  built  out  of  objective  functions  and  would 
therefore  lie  above  optimization  in  a  conventional  layered-system  diagram  of  the  modelling  language  we 
are  discussing.  Likewise  we  now  suggest  a  further  level  of  abstraction  beyond  grammars:  maps  between 
grammars,  for  example  the  map  by  which  one  grammar  is  optimized  to  approximate  another  one  which 
differs  in  number  of  rules  or  other  properties.  (Figure  4  is  the  corresponding  layered-system  diagram.) 
Other  important  maps  between  grammars  could  be  defined,  but  this  level  of  abstraction  is  beyond  the 
grasp  of  our  research  so  far. 

Programming  Language  Research.  In  this  section,  in  contrast  to  section  2.1,  we  have  mainly 
discussed  programming  language  issues  rather  than  prospects  for  new  mathematical  methods.  In  a 
research  program  to  develop  mathematical  methods,  why  is  this  necessary?  The  answer  is  that  our 
proposed  language  for  stating  and  solving  application  problems  with  neural  nets  must  be  both  imple- 
mentable  and  adequately  expressive,  as  must  any  other  computer  language.  Both  criteria  substantially 
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Maps  Between  Grammars 


Connectionist  Grammars 


Objective  Functions 


Real  Variables,  Graphs,  Algebra ... 


Figure  4:  Components  of  a  layered  modelling  language. 

constrain  the  language,  and  in  this  way  the  expressiveness  considerations  typical  of  programming  lan¬ 
guage  research  come  to  influence  what  implementation  questions  arise  -  i.e.  what  problems  we  will  be 
called  upon  to  solve  with  new  mathematical  methods. 

Of  course  there  also  exist  less  well  developed  and  nonquantitative  branches  of  mathematics  -  such  as 
logic  and  category  theory  -  which  pertain  directly  to  the  expressiveness  constraints  on  a  programming 
language.  It  is  likely  that  connectionist  grammars  will  develop  in  this  direction  to  some  extent. 
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3  Abstracts  of  works  supported  by  AFOSR-88-0240 


Eric  Mjolsness  and  Willard  L.  Miranker,  “A  Lagrangian  Approach  to  Fixed  Points”,  Neural  Information 
Processing  Systems  3: 

We  present  a  new  way  to  derive  dissipative,  optimizing  dynamics  from  the  Lagrangian  formulation 
of  mechanics.  It  can  be  used  to  obtain  both  standard  and  novel  neural  net  dynamics  for  optimization 
problems.  To  demonstrate  this  we  derive  standard  descent  dynamics  as  well  as  nonstandard  variants 
that  introduce  a  computational  attention  mechanism. 

Eric  Mjolsness,  Anand  Rangarajan,  and  Charles  Garrett,  “A  Neural  Net  for  Reconstruction  of 
Multiple  Curves  with  a  Visual  Grammar”,  1991  International  Joint  Conference  on  Neural  Networks, 
Seattle: 

We  derive  a  neural  net  for  reconstructing  a  set  of  curves  from  ungrouped  dot  locations.  The  net¬ 
work  performs  Bayesian  inference  on  a  visual  grammer ,  which  serves  as  probabilistic  model  of  the  image 
formation  process,  by  means  of  quadratic  matching  objective  function. 

Eric  Mjolsness,  “Bayesian  Inference  on  Visual  Grammars  by  Neural  Nets  that  Optimize”,  Technical 
Report  YALEU /DCS/TR854,  May  1991: 

We  exhibit  a  systematic  way  to  derive  neural  nets  for  vision  problems.  It  involves  formulating  a 
vision  problem  as  Bayesian  inference  or  decision  on  a  comprehensive  model  of  the  visual  domain  given 
by  a  probabilistic  grammar.  A  key  feature  of  this  grammar  is  the  way  in  which  it  eliminates  model 
information,  such  as  object  labels,  as  it  produces  an  image;  correspondance  problems  and  other  noise 
removal  tasks  result.  The  neural  nets  that  arise  most  directly  are  generalized  assignment  networks. 
Also  there  are  transformations  which  naturally  yield  improved  algorithms  such  as  correlation  matching 
in  scale  space  and  the  Frameville  neural  nets  for  high-level  vision.  Networks  derived  this  way  generally 
have  objective  functions  with  spurious  local  minima;  such  minima  may  commonly  be  avoided  by  dy¬ 
namics  that  include  deterministic  annealing,  for  example  recent  improvements  to  Mean  Field  Theory 
dynamics.  The  grammatical  method  of  neural  net  design  allows  domain  knowledge  to  enter  from  all 
levels  of  the  grammar,  including  “abstract”  levels  remote  from  the  final  image  data,  and  may  permit 
new  kinds  of  learning  as  well. 

Eric  Mjolsness  and  Charles  Garrett,  “Algebraic  Transformations  of  Objective  Functions”,  Neural 
Networks,  vol.3,  pp  651-669,  1990: 

Many  neural  networks  can  be  derived  as  optimization  dynamics  for  suitable  objective  functions.  We 
show  that  such  networks  can  be  designed  by  repeated  transformations  of  one  objective  into  another 
with  the  same  fixpoints.  We  exhibit  a  collection  of  algebraic  transformations  which  reduce  network 
cost  and  increase  the  set  of  objective  functions  that  are  neurally  implementable.  The  transformations 
include  simplification  of  products  of  expressions,  functions  of  one  or  two  expressions,  and  sparse  ma¬ 
trix  products  (all  of  which  may  be  interpreted  as  Legendre  transformations);  also  the  minimum  and 
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maximum  of  a  set  of  expressions.  These  transformations  introduce  new  interneurons  which  force  the 
network  to  seek  a  saddle  point  rather  than  a  minimum.  Other  transformations  allow  control  of  the 
network  dynamics,  by  reconciling  the  Lagrangian  formalism  with  the  need  for  fixpoints.  We  apply 
the  transformations  to  simplify  a  number  of  structured  neural  networks,  beginning  with  the  standard 
reduction  of  the  winner- take-all  network  from  0{N2)  connections  to  O(N).  Also  susceptible  are  inex¬ 
act  graph-matching,  random  dot  matching,  convolutions  and  coordinate  transformations,  and  sorting. 
Simulations  show  that  fixpoint-preserving  transformations  may  be  applied  repeatedly  and  elaborately, 
and  the  example  networks  still  robustly  converge. 

Eric  Mjolsness,  Charles  Garrett,  and  Willard  L.  Miranker,  “Multiscale  Optimization  in  Neural  Nets”, 
IEEE  Transactions  on  Neural  Networks,  vol.  2,  no.  2,  March  1991: 

One  way  to  speed  up  convergence  in  a  large  optimization  problem  is  to  introduce  a  smaller,  approxi¬ 
mate  version  of  the  problem  at  a  coarser  scale  and  to  alternate  between  relaxation  steps  for  the  fine-scale 
and  the  coarse-scale  problems.  We  exhibit  such  an  optimization  method  for  neural  networks  governed 
by  quite  general  objective  functions.  At  the  coarse  scale  there  is  a  smaller  approximating  neural  net 
which,  like  the  original  net,  is  nonlinear  and  has  a  nonquadratic  objective  function.  The  transitions  and 
information  flow  from  fine  to  coarse  scale  and  back  do  not  disrupt  the  optimization,  and  the  user  need 
only  specify  a  partition  of  the  original  fine-scale  variables.  Thus  the  method  can  be  applied  easily  to 
many  problems  and  networks.  We  show  positive  experimental  results  including  cost  comparisons. 

P.  Anandan,  Stanley  Letovsky,  and  Eric  Mjolsness,  “Connectionist  Variable- Binding  By  Optimiza¬ 
tion”,  August  1989  Cognitive  Science  conference  proceedings. 

Symbolic  AI  systems  based  on  logical  or  frame  languages  can  easily  perform  inferences  that  are  still 
beyond  the  capabilities  of  most  connectionist  networks.  This  paper  presents  a  strategy  for  implement¬ 
ing  in  connectionist  networks  the  basic  mechanisms  of  variable  binding,  dynamic  frame  allocation  and 
equality  that  underlie  mnay  of  the  types  of  inferences  commonly  handled  by  frame  systems,  including 
inheritance,  subsumption  and  abductive  inference.  The  paper  describes  a  scheme  for  translating  frame 
definitions  in  a  simple  frame  language  into  objective  functions  whose  minima  correspond  to  partial  de¬ 
ductive  closures  of  the  legal  inferences.  The  resulting  constrained  optimization  problem  can  be  viewed 
as  a  specification  for  a  connectionist  network. 

Eric  Mjolsness  David  H.  Sharp  and  John  Reinitz,  “A  Connectionist  Model  of  Development” ,  Journal 
of  Theoretical  Biology,  v  152,  pp.  429-453,  1991: 

We  present  a  phenomenological  modeling  framework  for  development.  Our  purpose  is  to  provide  a 
systematic  method  for  discovering  and  expressing  correlations  in  experimental  data  on  gene  expression 
and  other  developmental  processes.  The  modeling  framework  is  based  on  a  connectionist  or  “neural 
net”  dynamics  for  biochemical  regulators,  coupled  to  “grammatical  rules”  which  describe  certain  fea¬ 
tures  of  the  birth,  growth,  and  death  of  cells,  synapses  and  other  biological  entities.  We  outline  how 
spatial  geometry  can  be  included,  although  this  part  of  the  model  is  not  complete.  As  an  example  of 
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the  application  of  our  results  to  a  specific  biological  system,  we  show  in  detail  how  to  derive  a  rigorously 
testable  model  of  the  network  of  segmentation  genes  operating  in  the  blastoderm  of  Drosophila.  To  fur¬ 
ther  illustrate  our  methods,  we  sketch  how  they  could  be  applied  to  two  other  important  developmental 
processes:  cell  cycle  control  and  cell-cell  induction.  We  also  present  a  simple  biochemical  model  leading 
to  our  assumed  connectionist  dynamics  which  shows  that  the  dynamics  used  is  at  least  compatible  with 
known  chemical  mechanisms. 
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A  Appendix:  Mathematica  Code  for  Algebraic  Transformations 

Following  the  example  of  section  2.2,  we  present  some  more  extensive  Mathematica  code  (by  C.  Garrett) 
which  uses  symbolic  algebra  to  implement  the  objective  function  transformations  of  [2]  (included  as 
Appendix  B).  This  code  is  the  start  of  the  front-end  portion  of  a  neural  network  compiler. 

The  C++  back  end  has  also  been  started  but  obviously  it  would  be  too  long  to  include  as  an 
appendix.  To  make  the  present  appendix  self-contained,  we  demonstrate  the  syntax  appropriate  for 
optimizing  the  objective  functions  entirely  within  Mathematica,  as  well  as  that  for  using  the  much  more 
efficient  back  end. 

File  “nemesis-examples” 

A  Description  of  Nemesis  for  Mathematica 
7/28/92 


The  Nemesis. m  package  contains  Mathematica  code  for  constructing  and 
transforming  objective  functions.  We  sill  describe  how  the  code  vorks, 
focusing  on  several  examples.  The  description  will  break  down  into  3 
major  parts,  index  domains,  objective  function  transformations, 
and  how  to  run  networks. 


Index  Domains 

An  index  domain  is  a  set  of  values  that  an  index  may  have.  For  our 
purposes,  an  index  domain  will  be  either  an  interval,  or  a  disjoint 
union  or  cross  product  of  2  other  domains.  Here  are  some  examples  of 
valid  and  invalid  index  domains,  in  Mathematica-ese ,  and  English. 

IndexDomainfl ,  5]  A  single  index  with  values  from  1  to  5,  inclusive. 

IndexDomainC-2 ,  n]  A  single  index  with  values  from  '2  to  n. 
DisjointUnionClndexDomainfO,  10],  IndexDomain[20 ,  30]] 

A  single  index  with  values  from  0  to  10  and  from  20  to 
30. 

Dis jointUnion[IndexDomain[l ,  5],  IndexDomainfl ,  5]] 

A  single  index  which  takes  on  values  from  1  to  5  twice. 
CrossProductDomainflndexDomainCl ,  5],  IndexDomainfl ,  4]] 

A  two  part  index  which  takes  on  values  from  {1,  1}  to 
{5,  4>. 

If  these  function  names  seem  too  long,  remember  that  you  can  use  Mathematica 
to  equate  shorter  names  to  these,  for  instance  CPD  =  CrossProductDomain. 

We  use  long  names  simply  to  be  unambiguous. 
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That  accounts  f or  the  representation  of  index  domains,  but  how  axe  they 
used?  The  functions  SumOver  and  TableOf  interpret  the  index  domains.  For 
example  SumOver  can  break  down  a  complicated  index  domain  into  a  sum 
over  intervals  like  this: 

In[4]:=  BIGD  =  CrossProductDomain[DisjointUnion[IndexDomain[l,  5], 

IndexDomainCll ,  15]], 
DisjointUnion[IndexDomain[6,  10]  , 

IndexDomain[16,  20]]] 


In[5]:=  SumOver [x[i]  [j] ,  {i,  j,  BIGD}] 

Out [5] =  SumOver [x [i]  [j]  ,  {i,  1,  5},  { j ,  6,  10}]  + 

>  SumOver [x [i]  [j]  ,  {i,  1,  5},  { j ,  16,  20}]  + 

>  SumOver [x[i]  [j]  ,  -Ci,  11,  15},  {j  ,  6,  10}]  + 

>  SumOver  [x[i]  [j]  ,  -Ci ,  11,  15},  {j  ,  16,  20}] 


Here  you  can  see  that  the  domain  BIGD  has  4  separate  parts,  each  of  which 
has  2  dimensions,  and  the  SumOver  function  breaks  the  domain  down  into 
its  separate  parts. 


Objective  Function  Transformations 

Objective  function  transformations  take  a  part  of  an  expression  and 
replace  with  a  new  expression  that  shares  the  same  fixed  points,  but  have 
fewer  multiplications.  In  the  process  we  introduce  new  variables,  which 
may  act  to  minimize  or  maximize  the  objective  function.  The  transformations 
are  invoked  by  applying  one  of  the  transform  functions  to  an  objective, 
like  this: 


In [2]:=  SquareTransform[SumOver[x[i] ,  {i,  5}] *2] 

2 

0ut[2]=  -sigmal  +  2  sigmal  SumOver [x [i] ,  {i,  5}] 
In[3]:=  MultiplyTransformC(a  +  b)  (c  +  d)] 


2  2 
omegal  sigma2 

Out [3]  = - - - +  (c  +  d)  (-omegal  +  sigma2)  + 

2  2 
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2 


taul 

>  (a  +  b)  (sigma2  -  taul)  +  - 

2 

The  new  variables  always  have  unique  names,  so  that  they  do  not  conflict 
with  other  variables  in  the  objective  function.  Also  the  names  of  the 
variables  are  placed  into  lists  called  ReversedNeurons  and  FastNeurons, 
so  that  you  can  tell  the  optimizers  which  way  to  move  them. 

In[4]:  =  ReversedNeurons 

Out  [4]=  {sigmal,  sigma2> 

In[5]:=  FastNeurons 

0ut[5]=  {taul,  omegal} 

Running  Networks  (within  Mathematiea) 

We  have  written  some  optimizers  which  do  not  require  any  backend. 

They  are  defined  in  the  file  Descent. m.  The  next  section 
is  a  brief  demonstration  of  how  you  can  use  these  optimizers. 

First,  we  start  up  Mathematiea,  and  read  in  the  package  Descent. m. 
Nemesis. m  will  automatically  be  loaded  when  we  read  in  Descent. m. 

In[l]:=  <<Descent.m 

Descent. m  defines  two  functions,  GradDesc  and  SaddlePoint.  GradDesc 
takes  an  objective  function  as  its  argument.  It  gives  each  variable 
in  the  objective  a  random  initial  value,  and  then  performs  gradient 
descent  to  find  a  local  minimum  of  the  objective.  Here  is  a  simple 
example : 

In[2]:=  GradDesc [a"2] 

{0 .21235} 

{0.208103} 

{0.203941} 

...  (  many  lines  of  output  are  skipped  here  ) 

{0.0038888} 

{0.00381102} 

{0.0037348} 

Out [2]=  {a  ->  0.0037348} 

GradDesc  has  driven  the  variable  ’a’  close  to  its  minimum  value  of  0. 
Remember  that  if  you  run  the  same  network,  you  will  see  a  slightly  different 
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result  due  to  the  random  initial  conditions.  The  final  output  of  GradDesc  is 
a  rule  which  you  may  use  to  replace  the  symbol  'a'  by  its  nearly  optimal 
value.  In  order  to  calculate  the  value  of  the  objective  function,  you  can  say 

In [3] :=  a~2  /.  Out [2] 

Out [3]=  0.0000139487 

Mow  let’s  try  a  more  complicated  objective  function  which  is  defined 
in  Nemesis. m.  Here  is  the  definition  of  Match. 

************  code  from  Nemesis. m  ******* 

Match[x_,  m_,  n _]  := 

Sum0ver[(Sum0ver[x[i]  [j]  ,  <j  ,  n>]  -  1)~2,  <i,  m>]  + 

Sum0ver[(Sum0ver[x[i]  [j]  ,  <i,  m}]  -  1)~2,  -Cj ,  n}]  + 

SumOverCxCi]  [j]  (1  -  x[i][j]),  <i,  m>,  {j ,  n>]  + 

SumOver [Potential [x[i] Cj]] ,  <i,  m>,  {j ,  n>] 

************  end  of  code  **************** 

Match  returns  an  objective  function  which  implements  soft 
winner-take-all  constraints  on  the  rows  and  columns  of  a  rectangular 
matrix.  We  can  pass  this  directly  to  the  GradDesc  function.  Since 
this  network  is  fairly  large,  we  will  let  it  run  for  SO 0  steps. 

In[6]:=  GradDesc [MatchCM,  5,  5],  500] 

<0.282214,  0.268753,  0.212327,  0.231943,  0.274453,  0.279035,  0.146517, 

>  0.256717,  0.196147,  0.137234,  0.231377,  0.225728,  0.246399,  0.225022, 

>  0.105801,  0.213226,  0.278814,  0.15593,  0.235356,  0.275095,  0.295193, 

>  0.1642,  0.27119,  0.219052,  0. 146877} 


(  many  steps  removed  ) 

<0.0424614,  0.0471485,  0.0439794,  0.040272,  0.866579,  0.0424675,  0.0471054, 

>  0.868152,  0.039981,  0.043376,  0.0391433,  0.0425031,  0.0399867,  0.916185, 

>  0.0400393,  0.045329,  0.832142,  0.04626,  0.0421782,  0.0469582,  0.884906, 

>  0.0447747,  0.0422182,  0.0390809,  0.0423163} 

Out [6] =  <M [1] [l]  ->  0.0424614,  M[l] [2]  ->  0.0471485,  M[l][3]  ->  0.0439794, 

>  M[l] [4]  ->  0.040272,  M[l] [5]  ->  0.866579,  M[2][l]  ->  0.0424675, 
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> 


M [2] [2]  ->  0.0471054,  M[2][3]  ->  0.868152,  M[2][4]  ->  0.039981, 


>  H[2][5]  ->  0.043376,  M[3] [1]  ->  0.0391433,  M[3][2]  ->  0.0425031, 

>  M[3][3]  ->  0.0399867,  H[3]  [4]  ->  0.916185,  M[3][S]  ->  0.0400393, 

>  M [4] Cl]  ->  0.045329,  M[4][2]  ->  0.832142,  M[4][3]  ->  0.04626, 

>  M[4][4]  ->  0.0421782,  M[4] [5]  ->  0.0469582,  M[5] [1]  ->  0.884906, 

>  M[5][2]  ->  0.0447747,  M[5] [3]  ->  0.0422182,  H[5][4]  ->  0.0390809, 

>  M[5] [5]  ->  0.0423163} 

The  final  answer  is  close  to  a  permutation  network.  You  can  see  the 
structure  more  clearly  by  printing  out  a  table  of  M’s  in  MatrixForm. 

In[7]:=  MatrixForm  [Table  [M[i]  [j]  ,  {i,  5},  {j  ,  5}]  /.  0ut[6]] 


0ut[7]//MatrixForm=  0.0424614 

0.0471485 

0.0439794 

0 . 040272 

0.866579 

0.0424675 

0.0471054 

0.868152 

0.039981 

0.043376 

0.0391433 

0.0425031 

0.0399867 

0.916185 

0.0400393 

0.045329 

0.832142 

0.04626 

0.0421782 

0.0469582 

0.884906 

0.0447747 

0.0422182 

0.0390809 

0.0423163 

How,  we  will  try  using  an  objective  function  transformation 
on  the  Match  network.  As  a  first  step,  let’s  look  at  the  structure 
of  the  expression  generated  by  MatchCM,  5,  5], 

In[9]:=  func  =  Match[M,  5,  5] 


2 

Out [9] -  Sum0ver[(-1  +  SumOver [M[i] [j] ,  {i,  5}])  ,  {j ,  5}]  + 

2 

>  Sum0ver[(-1  +  SumOver [M[i] [j] ,  {j ,  5}])  ,  {i,  5}]  + 

>  SumOver [Potential[M[i] [j]] ,  {i,  5},  { j ,  5}]  + 

>  SumOverCCl  -  M[i][j])  M[i][j],  {i,  5},  {j  ,  5}] 

We  can  use  the  function  SquareTransf orm  to  produce  a  new  objective  function 
with  terms  of  the  form  x*2  transformed  into  2  x  sigma  -  simga'2. 


In  Cl 1] : =  tunc  =  SquareTransform[func] 

2 

Out [11] =  SumOver [-sigmal [j]  +  2  sigmal[j]  (-1  +  SumOver [M[i]  [j] ,  {i,  5>] ) , 

2 

>  {  j ,  5>]  +  SumOver  i.-sigma2[i]  + 

>  2  sigma2[i]  (-1  +  SumOver [M[i] [j] ,  {j  ,  5>] ) ,  {i,  5>]  + 

>  SumOverCO.1  (-Abs[-0.5  +  M[i][j]]  -  0.5  Log[0.5  -  Abs[-0.5  +  M[i][j]]])( 

>  {i,  5>,  -Cj,  S>]  +  SumOver  [(1  -  M[i][j])  M[i][j],  {i,  5>,  {j ,  5>] 

In  this  objective  function,  sigmal [j]  governs  the  column  constraints, 
and  sigma2[i]  governs  the  row  constraints.  Since  the  new  sigma  neurons  act 
to  maximize  the  objective  function,  we  cannot  use  the  same  GradDesc 
optimizer  which  sends  all  neurons  downhill.  Instead  we  use  the  optimizer 
SaddlePoint,  and  pass  it  a  list  of  the  reversed  neurons.  Conveniently, 
the  SquareTransf orm  function  appends  the  names  of  all  the  new  reversed 
neurons  it  creates  to  the  list  ReversedNeurons . 

In [12]:=  SaddlePoint [func,  ReversedNeurons ,  500] 

{0.112055,  0.287099,  0.235514,  0.23606,  0.129563,  0.110942,  0.282688, 

>  0.16964,  0.175481,  0.175472,  0.259398,  0.170251,  0.282214,  0.268753, 

>  0.206656,  0.27S856,  0.106098,  0.282597,  0.21139,  0.232409,  0.26157, 

>  0.135819,  0.10515,  0.21158,  0.104505,  0.199769,  0.105801,  0.225568, 

>  0.165773,  0.141062,  0.2621,  0.152875,  0.289367,  0.245156,  0.26151} 

(  many  steps  removed  ) 

{0.0404085,  0.0404085,  0.0452528,  0.881893,  0.0404189,  0.0374939,  0.919705, 

>  0.0408759,  0.0403384,  0.0375015,  0.0408049,  0.0408049,  0.875959, 

>  0.0453532,  0.0408156,  0.0375015,  0.0375015,  0.0408868,  0.0403486, 

>  0.919624,  0.919705,  0.0374939,  0.0408758,  0.0403384,  0.0375015,  0.076609, 

>  0.0766089,  0.0343611,  0.0401126,  0.076517,  0.0398592,  0.076615, 

>  0.0345962.  0.0765232,  0.0766152} 

Out [12] =  {M[l] [1]  ->  0.0404085,  K[l] [2]  ->  0.0404085,  M[l] [3]  ->  0.0452528, 

>  M[l][4]  ->  0.881893,  M[l] [5]  ->  0.0404189,  M[2] [1]  ->  0.0374939, 
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>  M [2] [2]  ->  0.919705,  M[2][3]  ->  0.0408759,  H[2][4]  ->  0.0403384, 

>  M[2][5]  ->  0.0375015,  M[3] [1]  ->  0.0408049,  M[3][2]  ->  0.04C8049, 

>  M[3][3]  ->  0.875959,  M[3] [4]  ->  0.0453532,  M[3]  [5]  ->  0.0408156, 

>  M[4][l]  ->  0.0375015,  H[4][2]  ->  0.0375015,  M[4]  [3]  ->  0.0408868, 

>  M[4][4]  ->  0.0403486,  M[4][5]  ->  0.919624,  M[5][l]  ->  0.919705, 

>  M[S]C2]  ->  0.0374939,  M[5] [3]  ->  0.0408758,  M[5] [4]  ->  0.0403384, 

>  H[5][5]  ->  0.0375015,  sigmal[l]  ->  0.076609,  sigmal[2j  ->  0.0766089, 

>  sigmal  [3]  ->  0.0343611,  sigmal[4]  ->  0.0401126,  sigmal[5]  ->  0.076517, 

>  sigma2  [l]  ->  0.0398592,  sigma2[2]  ->  0.076615,  sigma2[3]  ->  0.0345962, 

>  sigma2 [4]  ->  0.0765232,  sigma2[5]  ->  0.0766152> 

IIn[13]:=  MatrixForm [Table [M[i] [j] ,  {i,  5>,  {j  ,  5>]  /.  0ut[12]] 


Out [13]  //MatrixForm=  0.0404085 

0.0404085 

0 . 0452528 

0.881893 

0.0404189 

0.0374939 

0.919705 

0.0408759 

0.0403384 

0.0375015 

0.0408049 

0.0408049 

0.875959 

0.0453532 

0.0408156 

0.0375015 

0.0375015 

0.0408868 

0.0403486 

0.919624 

0.919705 

0.0374939 

0.0408758 

0.0403384 

0.0375015 

This  answer  is  also  a  good  permutation  matrix. 


Running  Networks  (with  the  back  end) 

In  order  to  use  the  C++  back  end,  you  must  install  the  routines  which 
are  in  the  file  MathTree.  These  routines  allow  you  to  describe  networks 
and  optimizers,  define  reversed  or  fast  neurons,  initialize  neuron 
values,  and  finally  run  networks. 

In[14]:=  Install ["Mat+Tree"] 

Out [14]=  LinkObj ect  [MathTree ,  1,  1] 

We  will  continue  with  the  permutation  matrix  example  network  that  we  used 
above,  and  this  time  we  will  write  the  objective  function  using  an  index 
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domain.  The  new  function  is  called  MatchDomain. 


In[31]:=  MatchDomain [x_ ,  id_]  := 

SumOver [(SumOver [x[i] [j] ,  {  j  ,  Dimension[2,  id]>]  -  1)~2, 

{i,  DimensionCl,  id]>]  + 

SumOver [ (SumOver [x [i] [j] ,  {i,  DimensionCl,  id]}]  -  1)*2, 

-Cj  ,  Dimension [2,  id]}]  + 

SumOver [x [i]  [j]  (1  -  x[i][j]),  {i,  j,  id}]  + 

SumOver [Potential [x[i] Cj]] ,  {i,  j,  id}] 

This  function  will  take  an  index  domain  with  2  dimensions,  and  impose 
winner-take-all  constraints  along  each  dimension.  Now  let’s  define 
an  index  domain  and  call  MatchDomain  to  see  what  the  objective  function 
will  look  like. 

In[32] :=  CPD  =  CrossProductDomain[IndexDomain[l ,  5],  IndexDomainCl ,  5]] 
0ut[32]  =  CrossProductDomain[IndexDomain[l ,  5],  IndexDomainCl,  5]] 
In[33]:=  MatchDomain CM,  CPD] 


2 

Out [33]  =  Sum0ver[(-1  +  SumOver [M[i]  Cj]  ,  {i,  1,  5}])  ,  {j  ,  1,  5}]  + 


> 

> 

> 


SumOver[(-l  +  SumOver [M[i] [j] ,  {j  , 
SumOver[Potential[H[i] Cj]] ,  {i,  1, 
SumOver [(1  -  M [i] Cj] )  M[i] Cj] ,  {i, 


2 

1,  5}])  ,  {i,  1,  5}]  + 
5},  Cj ,  1,  5}]  + 

1.  5},  {j,  1,  5}] 


We  now  communicate  the  objective  function  to  the  back  end  program  with  the 
function  SetUpNet,  and  choose  a  line  minimization  optimizer  with 
SetUpQptimizer . 


In[34]:=  SetUpNet [MatchDoraain[M,  CPD]] 


Variable 

0.088342 

0.194903 

0.039403 

0.024054 

0.037842 


M: 

0.113757 

0.111318 

0.137246 

0.035861 

0.016067 


0.027181 

0.085666 

0.015866 

0.139221 

0.087616 


0.094360 

0.111402 

0.092842 

0.090609 

0.051261 


0.036566 

0.103520 

0.034000 

0.033407 

0.122438 


Out [34]=  1 

In[35]:=  SetUpQptimizer [lm] 
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Using  line  minimizer 


Out [35]=  1 

Now  we  are  ready  to  run  the  network  with  RunNet. 

In [36]:=  RunNet [] 

Variable  M: 

0.179168  0.202559  0.130800  0.179705  0.137766 
0.267729  0.175160  0.154529  0.172607  0.175494 
0.138263  0.231581  0.138975  0.182257  0.140355 
0.131180  0.132569  0.239219  0.179584  0.139649 
0.137620  0.133204  0.185878  0.142404  0.224744 

energy  =  4.084525 

(  many  steps  ol  output  removed  ) 

energy  =  1.384581 

Variable  M: 

0.011132  0.011135  0.011132  0.985656  0.011132 
0.985724  0.011134  0.011131  0.011133  0.011131 
0.011129  0.985723  0.011129  0.011132  0.011129 
0.011131  0.011134  0.985724  0.011133  0.011131 
0.011131  0.011134  0.011131  0.011133  0.985724 


Out [36] =  [M [1] [1]  ->  0.0111322,  M[l][2]  ->  0.0111351,  M[l] [3]  ->  0.0111322, 

>  M [1] [4]  ->  0.986656,  M[l] [5]  ->  0.0111322,  M[2][l]  ->  0.985724, 

>  M[2] [2]  ->  0.0111339,  M[?][3]  ->  0.0111306,  M[2][4]  ->  0.0111333, 

>  M [2] [5]  ->  0.0111306,  M[3] [1]  ->  0.0111288,  M[3][2]  ->  0.985723, 

>  M[3] [3]  ->  0.0111288,  M[3][4]  ->  0.0111316,  M[3][5]  ->  0.0111288, 

>  M[4][l]  ->  0.0111306,  M[4] [2]  ->  0.0111339,  M[4][3]  ->  0.985724, 

>  M[4][4]  ->  0.0111333,  M[4][5]  ->  0.0111306,  M[S][i]  ->  0.0111306, 

>  M[5][2]  ->  0.0111339,  M[5] [3]  ->  0.0111306,  M[5][4]  ->  0.0111333, 

>  M[5][5]  ->  0 . 985724}’ 

The  answer  is  a  permutation  matrix,  as  we  expected.  And  now  just  to 
complete  the  demonstration,  we  will  run  the  permutation  matrix  network  with 
reversed  neurons. 


In[37]:=  func  =  SquareTransiorm[MatchDomain[M,  CPD]] 

2 

Out [37]=  SumOver [-sigmal [j]  + 

>  2  sigmal [j]  (-1  +  SumOver [M[i]  [j]  ,  {i,  1,  5>] ) ,  {j ,  1,  5>]  + 

2 

>  SumOver C-sigma2[i]  +  2  sigma2[i]  (-1  +  SumOver [M [i] [j] ,  {j ,  1,  S>] ) , 

>  {i,  1,  5>]  +  SumOver  [Potential [M[i] [j]] ,  {i,  1,  5>,  {j ,  1,  5>]  + 

>  SumOver  [(1  -  M[i][j])  M[i][j],  {i,  1,  5>,  { j ,  1,  5>] 

In [38]:=  SetUpNet  [func] 

Variable  sigma2: 

-0 . 233151  0.27S137  -1.4S6376  -0.112794  -1.268687 
Variable  M: 

0.194903  0.111318  0.085666  0.111402  0.103520 
0.039403  0.137246  0.015866  0.092842  0.034000 
0.024054  0.035861  0.139221  0.090609  0.033407 
0.037842  0.016067  0.087616  0.051261  0.122438 
0.155670  0.026563  0.064878  0.161136  0.069341 

Variable  sigmal: 

0.558585  1.640131  -0.606858  -0.872527  -1.816334 
Out [38]=  1 

In[39]:=  SetUpOptimizer [lm] 

Using  line  minimizer 
Out [39]=  1 

In[40] :=  SetUpReversedNeurons[ReversedNeurons] 


Out [40]=  1 

In[41]:=  RunNet[] 

Variable  sigma2: 

-0.393191  -0.680643  -0.676848  -0.684777  -0.522411 
Variable  M: 

0.710856  0.678656  0.593971  0.537001  0.640001 
0.758736  0.942796  0.947946  0.738820  0.839069 
0.807958  0.859952  0.890754  0.733150  0.837591 
0.764348  0.999758  0.825588  0.710739  0.895874 
0.748781  0.769826  0.676887  0.713906  0.704011 
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Variable  sigmal: 

-0.548127  -0.672945  -0.606754  -0.492750  -0.637294 
energy  =  -33.651328 

(  alter  many  steps  ) 


energy  =  1.384579 


Variable  sigma2: 

0.029720  0.029719  0.029244  0.029719  0.029719 
Variable  M: 

0.011176  0.011173  0.011176  0.011176  0.985732 
0.011176  0.011174  0.985732  0.011176  0.011176 
0.011173  0.985541  0.011173  0.011173  0.011173 
0.985732  0.011174  0.011176  0.011176  0.011176 
0.011176  0.011174  0.011176  0.985732  0.011176 


Variable  sigmal: 

0.029719  0.029243  0.029719  0.029719  0.029719 

Out [41]=  {sigma2 [l]  ->  0.02972,  sigma2[2]  ->  0.0297189, 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


> 


sigma2 [3]  ->  0.0292441,  sigma2[4]  ->  0.0297189,  sigma2[5]  ->  0.0297189, 
M[l][l]  ->  0.0111758,  M Cl] [2]  ->  0.0111735,  M[l][3]  ->  0.0111758, 
M[l][4]  ->  0.0111758,  M[l] [5]  ->  0.985732,  M[2][l]  ->  0.0111759, 

M[2][2]  ->  0.0111736,  M[2][3]  ->  0.985732,  M[2][4]  ->  0.0111759, 

M[2][5]  ->  0.0111759,  M[3][l]  ->  0.0111735,  H[3] [2]  ->  0.985541, 

M[3] [3]  ->  0.0111735,  M[3] [4]  ->  0.0111735,  M[3][5]  ->  0.0111735, 

M[4] [1]  ->  0.985732,  M[4] [2]  ->  0.0111736,  M[4][3]  ->  0.0111759, 

M[4][4]  ->  0.0111759,  M[4][5]  ->  0.0111759,  M[5][l]  ->  0.0111759, 

M[5] [2]  ->  0.0111736,  M[5][3]  ->  0.0111759,  M[5] [4]  ->  0.985732, 

M[5] [5]  ->  0.0111759,  sigmalCl]  ->  0.0297194,  sigmal[2]  ->  0.0292435, 
sigmal [3]  ->  0.0297194,  sigmal[4]  ->  0.0297194,  sigraal[5]  ->  0.0297194) 
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File  “Nemesis. m” 


WTA: : usage  = 

"WTACx,  i>]  returns  a  complete  objective  function  for  a  winner 
take  all  net,  involving  variables  x[l]  through  x[n]." 

Match: : usage  = 

"MatchCx,  m,  n]  returns  a  complete  objective  function  for  a 
permutation  matrix,  involving  variables  xCl] [1]  through  x[m][n]." 

SquareTransform: :usage  = 

"SquareTransformCe]  returns  a  copy  of  e,  except  that  the  terms 
of  the  form  x“2  are  replaced  by  2  x  sigma  -  sigma'2." 

MultiplyTransform: :usage  = 

"MultiplyTransform[e]  returns  a  copy  of  e,  except  that  the  terms 
of  the  form  x  y  are  replaced  by  x  (sigma  -  tau)  +  y  (sigma  -  omega) 

+  1/2  (-sigma“2  +  tau'2  +  omega*2)." 

Mult iplyTransform2 :: usage  = 

"MultiplyTransf orm2[e]  returns  a  copy  of  e,  except  that  the  terms 

of  the  form  x  y  are  replaced  by  1/2  x  (sigma  -  tau)  +  1/2  y  (sigma  +  tau) 

+  1/4  (-sigma“2  +  tau“2)." 

Exponent ialTransform: : usage  = 

"Exponent ialTransf orm [e]  returns  a  copy  of  e,  except  that  the  terms 
of  the  form  ExpCx]  are  replaced  by  (x  +  1)  sigma  +  sigma  Log[sigma]." 

XLogXTransf orm: : usage  = 

"XLogXTransformCe]  returns  a  copy  of  e,  except  that  the  terms 
of  the  form  Abs[x]  Log[Abs[x]]  are  replaced  by  Abs  [x]  sigma  - 
Exp [sigma] . " 

LogXTransform:  -.usage  = 

"LogXTransf orm[e]  returns  a  copy  of  e,  except  that  the  terms 
of  the  form  Log[Abs[x]]  are  replaced  by  x  sigma  -  Log[Abs [sigma]] . " 

AbsPowerTransf orm: :usage  = 

"AbsPowerTransf orm[e]  returns  a  copy  of  e,  except  that  the  terms 
of  the  form  Abs [x] “p/p  are  replaced  by  x  sigma  - 
Abs [sigma] “(p/(p-l))  ((p-l)/p)." 

SumOver: : usage  = 

"SumOver[e,  [i,  n],  {j ,  m>,  ...]  represents  a  sum  over  expression 
e,  with  summation  indices  i,  j,  ...  just  like  the  built  in  Sum 
function.  The  difference  is  that  Mathematica  will  not  try  to 
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expand  SumOver  the  say  it  expands  Sum." 

TableOf :: usage  = 

"TableOf [e,  {i,  n>,  {j ,  m>,  ...3  represents  a  list  of  expressions 
e,  with  indices  i,  j,  ...  just  like  the  built  in  Table  function. 

The  difference  is  that  Mathematica  will  not  try  to  expand  TableOf 
the  way  it  expands  Table." 

IndexDomain: : usage  = 

"IndexDomainCl ,  h]  is  is  a  domain  in  which  an  index  can  take  on 
values  1,  1+1,  1+2 . h." 

DisjointUnion: : usage  = 

"The  DisjointUnion  of  2,  IndexDoraains  is  a  domain  which 
contains  all  of  the  index  values  from  each  domain,  and  repeats 
those  values  which  occur  in  more  than  both  domains." 

CrossProductDomain: : usage  = 

"The  CrossProductDomain  of  2  IndexDomains  is  a  domain  which 
contains  one  index  value  for  each  possible  combination  of  index  values 
in  the  original  domains." 

ReversedNeurons : :usage  = 

"The  list  ReversedNeurons  contains  the  neurons  introduced  by  objective 
function  transformations  which  act  to  maximize  the  objective." 

FastNeurons :: usage  = 

"The  list  FastNeurons  contains  the  neurons  introduced  by  objective 
function  transformations  which  act  to  minimize  the  objective." 

(*  Some  basic  objective  functions  *) 


WTA[x_,  n_]  := 

(SumOver[x[i]  ,  {i,  n>]  -  1)*2  + 

SumOver[x[i]  (1  -  x[i]),  {i,  n>]  + 

SumOver [Potential [x[i] ,  0,  1],  {i,  n}] 

Match[x_,  m_,  n_]  := 

Sum0ver[(Sum0ver [x[i]  [j]  ,  { j  ,  n>]  -  1)*2,  {i,  m>]  + 
SumOver  [(SumOver  [x  [i]  [j]  ,  {i,  m>]  -  1)*2,  {j  ,  n>]  + 
SumOver [x[i]  [j]  (1  -  x[i][j]),  {i,  m>,  { j ,  n>]  + 
SumOver[Potential[x[i] [j]] ,  {i,  m>,  {j  ,  n>3 


(*  Objective  function  transformations  *) 

(*  Notice  the  square  transform  only  applies  to  non-Atomic  expressions. 
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Each  transl ormation  rule  sill  be  responsible  lor  adding  the  neurons 
it  creates  to  the  lists  ReversedNeurons  or  FastNeurons.  *) 

ReversedNeurons  =  O 

FastNeurons  =  O 

(*  Freelndices  [x]  returns  the  indices  in  expression  x  which  are  not 
bound  by  any  SumOver  expression  in  x  itself.  *) 

Freelndices [SumOver [x_,  {i_,  id_>]]  := 

Complement [Freelndices [x] ,  {i}] 

Freelndices [SumOver [x_,  {i_,  j_Integer,  id_>]]  := 

Complement [Freelndices [x] ,  {i}] 

Freelndices [SumOver[x_,  {i_,  j_,  id_>]]  :  = 

Complement [Freelndices [x] ,  {i,  j}] 

Freelndices  [SumOver  [x_,  {i_,  j_,  k_,  id — >]]  :  = 

Complement [Freelndices [x] ,  {i,  j,  k>] 

Freelndices  [SumOver  [x_,  {i_,  j_,  k_,  1_,  id — >]]  :  = 

Complement [Freelndices [x] ,  {i,  j,  k,  1>] 

Freelndices [Plus  [x_,  y_]]  :  = 

Union [Freelndices  [x] ,  Freelndices [y]] 

Freelndices [Times [x_,  y_J]  := 

Union [Freelndices [x] ,  Freelndices [y]] 

Freelndices [Power [x_,  y_]]  := 

Union[FreeIndices [x] ,  Freelndices [y]] 

Freelndices [Log[x_]]  := 

Freelndices [x] 

Freelndices [Exp  [x_]]  :  = 

Freelndices [x] 

Freelndices [Abs [x_]]  := 

Freelndices [x] 

FreeIndices[Potential[x_]]  := 

Freelndices [x] 

Freelndices  [x_]  :  = 

Block  [{}, 
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If [HumberQ[x]  II  Head[x]  ==  Symbol,  Return [{>] ,  0]; 

If [Head[x[[l]]]  ==  Symbol,  Return [Union [Freelndices [Head[x]] , 
CxCCl]]}]],  0]; 

Freelndices [Head [x]] 

] 

Addlndices [s_,  li_]  := 

Block [{} , 

If [Length[li]  ==  0,  Return [s] ,  0]; 

Addlndices[s[  li[[l]]  ],  Rest[li]] 

] 

SquareTransf ormRule  =  Power  [x_,  2]  :> 

Block[{sig  =  Unique ["sigma"] ,  fi  =  Freelndices [x] >, 

AppendTo [ReversedHeurons ,  sig]  ; 

2  x  Addlndices [sig,  fi]  - 
Addlndices [sig,  fi]*2]  /;  !AtomQ[x] 

SquareTransf orm[e_]  := 
e  /.  SquareTransf ormRule 

(*  The  multiplication  transformation  will  only  be  applied  when  neither  of 
the  two  expressions  are  simply  numbers,  and  when  neither  contains  any 
reversed  neurons.  The  transformation  introduces  3  new  neurons,  sigma, 
tau  and  omega.*) 

Mult iplyTransf ormRule  =  Times [x_,  y_]  :> 

Block[-Csigma  =  Unique  ["sigma"]  ,  tau  =  Unique  ["tau"]  , 

omega  =  Unique ["omega"] ,  fi  =  Freelndices [x] > , 

AppendTo [ReversedHeurons ,  sigma] ; 

AppendTo [FastNeurons ,  tau]; 

AppendTo [FastHeurons ,  omega]; 

x  (Addlndices [sigma,  fi]  -  Addlndices [tau,  fi])  + 
y  (Addlndices [sigma,  fi]  -  Addlndices [omega,  fi])  - 
Addlndices  [sigma,  fi]*2/2  +  Addlndices [tau,  fi]*2/2  + 

Addlndices [omega,  fi]'2/2]  /; 

FreeReversedQ[x]  kk  FreeReversedQ [y]  kk  !NumberQ[x]  kk  !NumberQ[y] 

(*  FreeReversedQ  returns  True  if  the  expression  x  does  not  contain  any  of 
the  symbols  on  the  list  ReversedHeurons.  It  returns  False  otherwise  *) 
FreeReversedQ [x_]  :=  Block[(i>, 

For[i=l,  i<=Length[ReversedNeurons] ,  i++,  If[FreeQ[x, 
ReversedHeurons [[i]]]  ,  ,  Retum[False]]]  ; 

Return [True]] 

Mult iplyTransf orm[e_]  := 
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e  /.  Mult iplyTransf ormRule 

(*  The  second  multiplication  transformation  just  introduces  2  new  variables.  *) 

MultiplyTransformRule2  =  Times [x_,  y_]  :> 

Block[{sigma  =  Unique ["sigma"] ,  tau  =  Unique ["tau"] , 
ii  =  Freelndices[x]>, 

AppendTo [ReversedNeurons,  sigma]; 

AppendTo [FastNeurons ,  tau]; 

x  (Addlndices [sigma,  fi]  -  Addlndices [tau,  fi])/2  + 
y  (Addlndices [sigma,  fi]  +  Addlndices [tau,  fi])/2 

-  Addlndices [sigma,  fi]*2/4  +  Addlndices [tau,  fi]*2/4]  /; 
FreeReversedQ[x]  Aft  FreeReversedQ[y]  AA  !NumberQ[x]  AA  !NumberQ[y] 

MultiplyTransform2[e_]  := 
e  /.  MultiplyTransformRule 

Exponent ialTransformRule  =  Power [E,  x_]  :> 

Block[{sigma  =  Unique ["sigma"]  ,  fi  =  Freelndices [x] > , 

AppendTo [ReversedNeurons ,  sigma]; 

(x  +  1)  Addlndices [sigma,  fi]  - 
Addlndices [sigma,  fi]  Log [Addlndices [sigma,  fi]]] 

Exponent ialTransf orm[e_]  := 
e  /.  Exponent ialTransformRule 

XLogXTransformRule  =  Abs[x_]  Log[Abs[x_]]  :> 

Block[{sigma  =  Unique ["sigma"]  ,  fi  =  Freelndices [x] > , 

AppendTo [ReversedNeurons,  sigma]; 

Abs[x]  Addlndices [sigma,  fi]  -  Exp[AddIndices [sigma,  fi]]] 

XLogXTransf orm[e_]  := 
e  /.  XLogXTransformRule 

LogXTransf ormRule  =  Log[Abs[x_]]  :> 

Block[{sigma  =  Unique ["sigma"] ,  fi  =  Freelndices [x] > , 

AppendTo [FastNeurons ,  sigma]; 

x  Addlndices [sigma,  fi]  -  Log [Abs [Addlndices [sigma,  fi]]]] 

LogXTransf orm[e_]  :  = 
e  /.  LogXTransf ormRule 

AbsPowerTransf ormRule  =  Abs [x_] *p_/p_  :> 

Block[{sigma  =  Unique ["sigma"] ,  fi  =  Freelndices [x] > , 

AppendTo [FastNeurons ,  sigma] ; 
x  Addlndices [sigma,  fi]  - 

Abs [Addlndices [sigma,  f i]] * (p/(p-l) )  ((p-l)/p)]  /;  FreeReversedQ [x] 
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AbsPouerTransf orm[e_]  := 
e  /.  AbsPowerTransformRule 


(*  Index  domain  notation  *) 


Parts [IndexDomain[l_,  h_]]  :=  1 

Parts [CrossProductDomain[l_,  r_]]  :=  PartsCl]  *  Paxts[r] 

Parts [DisjointUnionCl.,  r_]]  :=  PartsCl]  +  PartsCr] 

Size[IndexDomain[l_,  h_]]  :=  1 

Size[CrossProductDomain[l_,  r_]]  :=  Size[l]  +  SizeCr] 

SizeCDisjointUnion[l_,  r_]]  :=  If [Size[l]  ==  SizeCr],  Sized],  0] 

Dimension[n_,  IndexDomain[l_ ,  h_]]  :=  If[n  ==  1,  IndexDomainCl ,  h] ,  0] 
Dimension [n_,  CrossProductDomain[l_ ,  r_]]  := 

If[n  <=  SizeCl],  DimensionCn,  1],  DimensionCn  -  SizeCl],  r]] 

Dimension[n_ ,  Dis jointUnion[l_ ,  r_]]  :  = 

Dis jointUnionCDimensionCn,  1],  DimensionCn,  r]] 

Start [part_,  size.,  IndexDomainCl.,  h_]]  :=  If [part  ==  1  ftft  size  ==  1,  1,  0] 
FinishCpart. ,  size.,  IndexDomainCl.,  h_]]  :=  IfCpart  ==  1  Aft  size  ==  1,  h,  0] 

StartCpart.,  size.,  CrossProductDomainCl. ,  r_]]  := 

If [size  <=  SizeCl],  Start [Floor [ (part- 1)  /  PartsCr]]  +  1,  size,  1], 

Start [Mod [(part  -  1),  PartsCr]]  +  1,  size  -  SizeCl],  r]] 

FinishCpart.,  size.,  CrossProductDomainCl.,  r_]]  := 

If [size  <=  SizeCl],  Finish[Floor[(part-l)  /  PartsCr]]  +  1,  size,  1], 

Finish [Hod [(part  -  1),  PartsCr]]  +  1,  size  -  SizeCl],  r]] 

StartCpart.,  size.,  DisjointUnionCl. ,  r_]]  := 

IfCpart  <=  PartsCl],  StartCpart,  size,  1], 

StartCpart  -  PartsCl],  size,  r]] 

FinishCpart.,  size.,  DisjointUnionCl.,  r_]]  := 

IfCpart  <=  PartsCl],  FinishCpart,  size,  1], 

FinishCpart  -  PartsCl],  size,  r]] 

DomainQ[id_]  :=  HeadCid]  ==  IndexDomain  II  HeadCid]  ==  CrossProductDomain  II 
HeadCid]  ==  Dis jointUnion 

SumOverCf.,  {i_,  IndexDomainCl.,  h_]}]  :=  SumOverCf,  {i,  1,  h>] 

Sum0ver[f_,  {i.,  id.  /;  DomainQCid]}]  :=  Module [{p}, 

ReplacePart [TableCSumOver [f ,  {i,  StartCp,  1,  id],  FinishCp,  i,  id]>], 

(p,  Parts [id] >]  ,  Plus,  0]] 
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SumOver[f_,  {i_,  j_,  id_  /;  DomainQ [id] }]  :=  Module[{p}, 

ReplacePart [Table [SumOver [f,  {i,  Start[p,  1,  id],  Finish[p,  1,  id]}, 

-C  j ,  Start  [p,  2,  id],  Finish  [p,  2,  id]}], 

{p,  Parts [id]}].  Plus,  0]] 

SumOver [i_,  {i_,  j_,  k_,  id_  /;  DomainQ [id] }]  :=  Module [{p}, 
ReplacePart [Table [SumOver[i,  {i,  Start[p,  1,  id],  Finish[p,  1,  id]}, 
<j ,  Start [p,  2,  id],  Finish[p,  2,  id]}, 

{k,  Start[p,  3,  id],  Finish[p,  3,  id]}], 

{p.  Parts [id]}].  Plus,  0]] 

SumOver [f_,  <i_ .  j_,  k_,  1_,  id_  /;  DomainQ[id] }]  :=  Module[{p}, 
ReplacePart [Table [SumOver [f,  {i,  Start[p,  1,  id],  Finish[p,  1,  id]}, 

{  j  ,  Start  [p,  2,  id].  Finish  [p,  2,  id]}, 

{k,  Start[p,  3,  id],  Finish[p,  3,  id]}, 

{1,  Start[p,  4,  id],  Finish[p,  4,  id]}], 

{p,  Parts [id]}],  Plus,  0]] 

TableOf [f_,  {i_,  IndexDomain[l_,  h_]}]  :=  Table01[l,  {i,  1,  h}] 

TableOi [f _ ,  {i_,  id_  /;  Domainq[id]}]  :=  Module [{p}, 

ReplacePart [Table [TableOf [i ,  <i.  Start [p,  1,  id].  Finish [p,  1,  id]}], 
■Cp,  Parts  [id]  }]  ,  Plus,  0]] 

TableOf [1 _ ,  {i_,  j_,  id_  /;  Domainq [id] }]  :=  Hodule[{p}, 

ReplacePart [Table [TableOf[f,  <i ,  Start [p,  1,  id],  Finish[p,  1,  id]}, 

< j ,  Start [p,  2,  id].  Finish [p,  2,  id]}], 

{p,  Parts [id]}].  Plus,  0]] 

TableOl [f _ ,  <i_ ,  j_,  k_,  id_  /;  Domainq [id] }]  :=  Module [{p}, 
ReplacePart [Table [Tabled [f ,  {i,  Start [p,  1,  id],  Finish [p,  1,  id]}, 
{j ,  Start [p,  2,  id],  Finish [p,  2,  id]}, 

■Ck,  Start [p,  3,  id].  Finish [p,  3,  id]}], 

{p,  Parts[id]}],  Plu3,  0]] 

TableOf [f_,  {i_,  j_,  k_,  1_,  id_  /;  Domainq [id] }]  :=  Module [{p}, 
ReplacePart [Table [TableOf [f,  {i,  Start[p,  1,  id],  Finish[p,  1,  id]}, 
{ j ,  Start [p,  2,  id],  Finish [p,  2,  id]}, 

{k,  Start [p,  3,  id],  Finish[p,  3,  id]}, 

{1,  Start [p,  4,  id],  Finish [p,  4,  id]}], 

{p,  Parts [id]}].  Plus,  0]] 
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File  “Descent. m” 

(Descent. m  isn’t  needed  in  the  presence  of  the  back  end.) 


<<Nemesis .m 


GradDesc :: usage  = 

"GradDescCi]  minimizes  the  objective  function  f.  The  optimizer  performs 
gradient  descent  with  a  fixed  step  size  for  200  steps.” 

SaddlePoint :: usage  = 

"SaddlePoint  [f ,  {rl,  r2,  r3,  . ..}]  finds  a  saddle  point  of  the  objective 
function  f  by  minimizing  it  with  respect  to  all  variables  except  rl, 
r2,  r3,  etc.  The  optimizer  performs  gradient  descent/ascent  with  a 
fixed  step  size  for  200  steps." 

ExtractSymbols [x_]  := 

Block [{n  =  Length [x]  ,  i>, 

If[n  ==  0  kk  Head[x]  ==  Symbol,  Return [x] ,  0]; 

If[n  ==  0  kk  Head[x]  !=  Symbol,  Return [{}] ,  0]; 

If[n  ==  1,  Return [El imFunc [x] ] ,  0]; 

Union[  FlattenC  Tablet  ExtractSymbols [  x[[i]]  ],  {i,  n>]  ]  ] 

] 


El imFunc [x_]  := 
Block[{>, 

If[Head[x]  == 
If [Head [x]  == 
IftHeadtx]  == 
If [Head [x]  == 
x 


Abs,  Return [{}] ,  0] ; 
Exp,  Return [{}] ,  0]; 
Log,  Return [O]  ,  0]; 
Tanh,  Return  CO]  ,  0]  ; 


max  =  1 


min  =  0 


gain  =  50 

range  : =  0.5  *  (max  -  min) 
avg  :=  0.5  *  (max  +  min) 

Transfer[x_]  :=  gain*x  /  (1  +  Abs [gain*x/range] )  +  avg 
InvTransf er Cy_]  :=  (y  -  avg)  /  (gain*(l  -  Abs[(y  -  avg) /range] ) ) 

Potential[x_]  :=  10  range* (-range*Log [range  -  Abs[x-avg]]  -  Abs [x-avg] ) /gain 
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Abs’  :=  Sign 


GradDesc[f_]  :=  InteraalGradDesc[f  /.  SumOver  ->  Sum] 

GradDesc[f_,  nsteps_]  :=  InteraalGradDesc  [f  /.  SumOver  ->  Sum,  nsteps] 

InternalGradDesc [f _ ,  nsteps_]  :  = 

Block  [-Ci,  j,  args  =  ExtractSymbols [f] , 
len,  table,  rules  =  O,  dxs), 
len  =  Length [args] ; 

table  =  Table [0.1  +  0.2  Random □  ,  {i,  len)]; 

Print [table] ; 

For[i=l,  i<=len,  i++, 
dxs [i]  =  D[f,  args [[i]]] ; 

AppendTo [rules ,  args[[i]]  ->  table [[i]]  ] 

]  ; 

For[j=0,  j<nsteps ,  j++, 

For[i=l,  i<=len,  i++, 

table [[i]]  =  table [[i]]  -  0.01  (dxs[i]  /.  rules); 
rules[[i]]  =  args[[i]]  ->  table[[i]] 

]; 

Print [j] ; 

Print [table] ; 

]; 

rules 

] 

FindReversed[all_,  rev_]  := 

Flatten[  Hap[  Function[  FindName[  #1,  rev  ]  ],  all]  ] 

FindMame[x_,  names_]  := 

If [MatchName[x,  names],  x,  <>] 

MatchName[x_,  names_]  := 

If [MatchQ[Head[x] ,  Symbol], 

M ember Q [names ,  x] ,  Mat chHame [Head [x] ,  names]] 

SaddlePoint[f_,  revnames_]  := 

IntemalSaddlePoint [f  /.  SumOver  ->  Sum,  revnames] 

SaddlePoint[f_,  revnames. .  nsteps_]  := 

IntemalSaddlePoint [f  /.  .  -  SumOver  ->  Sum,  revnames,  nsteps] 

IntemalSaddlePoint [f_,  revnames.,  nsteps.]  :  = 

Block[{i,  j,  args  =  ExtractSymbols [f] ,  revargs ,  len,  revlen, 
table,  rules  =  O,  dxs,  dxrs), 
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revargs  =  FindReversed[args ,  revnames] ; 
args  =  Complement [args ,  revargs]; 
len  =  LengthCargs] ; 
revlen  =  Length Lrevargs]  ; 

table  =  Table [0.1  +  0.2  Random []  ,  {i,  len  +  revlen}]  ; 
Print [table]  ; 

For[i=l,  i<=len,  i++, 
dxs [i]  =  D[f,  args[[i]]]; 

AppendTo [rules,  args[[i]]  ->  table [[i]]  ] 

3; 

For[i=l,  i<=revlen,  i++, 
dxrs[i]  =  D[l,  revargs [[i]]] ; 

AppendTo [rules,  revargs[[i]J  ->  table[[i+len]]  ] 

]; 

For[j=0,  j<nsteps,  j++, 

For[i=l,  i<=len,  i++, 

table [[i]]  =  table[[i]]  -  0.01  (dxs[i]  /.  rules); 
rules[[i]]  =  args[[i]]  ->  table[[i]] 

3; 

For [i=l ,  i<=revlen,  i++, 

table[[i+len]]  =  table[[i+len]]  +0.1  (dxrs[J.]  /.  rules); 
rules [[i+len]]  =  revargs[[i]]  ->  table [[i+len]] 

] 

Print [j] ; 

Print [table] ; 

]; 

rules 
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B  Appendix:  Algebraic  Transformations  of  Objective  Functions 
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Abstract — Many  neural  networks  can  be  derived  as  optimization  dynamics  for  suitable  objective  functions.  We 
show  that  such  networks  can  be  designed  by  repeated  transformations  of  one  objective  into  another  with  the 
same  fixpotnts.  We  exhibit  a  collection  of  algebraic  transformations  which  reduce  network  cost  and  increase  the 
set  of  objective  functions  that  are  neurally  implementable.  The  transformations  include  simplification  of  products 
of  expressions,  functions  of  one  or  two  expressions,  and  sparse  matrix  products  (all  of  which  may  oe  interpreted 
as  Legendre  transformations);  also  the  minimum  and  maximum  of  a  set  of  expressions.  These  transformations 
introduce  new  interneurons  which  force  the  network  to  seek  a  saddle  point  rather  than  a  minimum.  Other 
transformations  allow  control  of  the  network  dynamics,  by  reconciling  the  Lagrangian  formalism  with  the  need 
for  fixpoints.  We  apply  the  transformations  to  simplify  a  number  of  structured  neural  networks,  beginning  with 
the  standard  reduction  of  the  winner-take-all  network  from  1 9(<V-’)  connections  to  0(N).  Abo  susceptible  are 
inexact  graph-matching,  random  dot  matching,  convolutions  and  coordinate  transformations,  and  sorting.  Sim¬ 
ulations  show  that  fixpoint-preservmg  transformations  may  be  applied  repeatedly  and  elaborately,  and  the 
example  networks  still  robustly  converge. 

Keywords — Objective  function,  Structured  neural  network.  Analog  circuit.  Transformation  of  objective.  Fix- 
point-preserving  transformation,  Lagrangian  dynamics.  Graph-matching  neural  net.  Winner-take-all  neural 
net. 


1.  INTRODUCTION 

Objective  functions  have  become  important  in  the 
study  of  artificial  neural  networks,  for  their  ability 
to  concisely  describe  a  network  and  the  dynamics  of 
its  neurons  and  connections.  For  neurons  with  ob¬ 
jective-function  dynamics,  the  now  standard  proce¬ 
dure  (Hopfield,  1984;  Hopfield  &.  Tank,  1985;  Koch, 
Marroquin  &  Yuille,  1986)  is  to  formulate  an  objec¬ 
tive  function  (called  the  “objective”  in  what  follows) 
which  expresses  the  goal  of  the  desired  computation, 
then  to  derive  a  locai  update  rule  (often  a  simple 
application  of  steepest  descent)  which  will  optimize 
the  dynamic  variables,  in  this  case  the  artificial  an¬ 
alog  neurons.  The  update  rule  should  ultimately  con¬ 
verge  to  a  fixpoint  which  minimizes  the  objective  and 
should  be  interpretable  as  the  dynamics  of  a  circuit. 
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This  procedure  is  direct  but  has  drawbacks.  For 
example  it  considers  only  the  goal  of  the  computation 
and  not  the  cost  for  attaining  the  goal  or  the  path 
taken  in  doing  so.  The  resulting  neural  nets  can  be 
quite  expensive  in  their  number  of  connections,  and 
for  some  objectives  the  associated  local  update  rule 
has  an  algebraic  form  unsuitable  for  direct  imple¬ 
mentation  as  a  neural  network. 

In  this  paper  we  show  how  to  modify  the  standard 
procedure  by  interpolating  an  extra  step;  after  the 
objective  is  formulated,  it  can  be  algebraically  ma¬ 
nipulated  in  a  way  which  preserves  its  meaning  but 
improves  the  resulting  circuits  (e.g.,  by  decreasing 
some  measure  of  their  cost).  Then  the  improved  cir¬ 
cuit  is  derived  from  the  modified  objective.  Often  it 
will  be  clear  from  the  algebra  alone  that  some  savings 
in  number  of  connections  will  occur,  or  that  a  pre¬ 
viously  nonimplementable  objective  has  been  trans¬ 
formed  to  a  form  (e.g.,  single-neuron  potentials  plus 
polynomial  interactions  (Hopfield,  1984))  whose 
minimization  may  be  directly  implemented  as  an  an¬ 
alog  circuit  in  a  given  technology.  And  one  inter¬ 
esting  class  of  transformations  can  establish  detailed 
control  over  the  state-space  trajectory  followed  dur¬ 
ing  optimization. 

We  do  not  require  that  the  algebraic  manipulation 
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be  done  automatically;  we  just  ask  whether  and  how 
it  can  be  done.  Our  answer  is  in  the  form  of  a  short, 
nonexhaustive  list  of  very  general  transformations 
which  can  be  performed  on  summands  of  an  objec¬ 
tive,  provided  they  are  of  the  requisite  algebraic 
form,  without  altering  the  t'ixpoints  of  the  resulting 
network.  Many  of  these  transformations  introduce  a 
relatively  small  number  of  new  neurons  which  change 
the  local  minima  of  the  original  objective  into  saddle 
points  of  the  new  one,  and  whose  dynamical  behavior 
is  to  seek  saddle  points.  It  also  seems  likely  that  many 
useful  and  general  algebraic  transformations  await 
discovery. 

Although  we  use  the  terminology  appropriate  to 
the  optimization  of  objectives  for  dynamic  neurons 
with  fixed  connections,  much  of  the  theory  may  apply 
also  to  “learning”  considered  as  the  optimization  of 
dynamic  connections  in  an  unstructured  net  (e.g., 
Rumelhart,  Hinton  &  Williams,  1986b)  or  of  some 
smaller  set  of  parameters  which  indirectly  determine 
the  connections  in  a  structured  net  (Mjolsness,  Sharp 
&  Alpert,  1989b). 

In  the  remainder  of  this  section,  we  will  introduce 
the  ideas  by  rederiving  a  well-known  network  sim¬ 
plification:  that  of  the  winner-take-all  (WTA)  net¬ 
work  from  C(N')  to  c\N)  connections.  Section  2 
develops  the  theory  of  our  algebraic  transformations, 
including  the  reduction  of:  squares  and  products  of 
expressions;  a  broad  class  of  functions  of  one  and 
two  expressions;  the  minimum  and  maximum  of  a 
set  of  expressions;  and  certain  matrix  forms  which 
retain  their  sparseness  as  they  are  reduced.  Impli¬ 
cations  for  circuit  design  are  discussed,  and  a  major 
unsolved  problem  related  to  the  handling  of  sparse 
matrices  is  stated.  Further  algebraic  transformations 
(section  2.7)  allow  control  of  the  temporal  aspects 
of  the  optimization  process,  by  modifying  the  usual 
Lagrangian  formalism  (which  uses  variational  cal¬ 
culus  to  derive  time-reversible  dynamics)  to  accom¬ 
modate  the  need  for  Fixpoints  in  neural  network  dy¬ 
namics.  All  the  fixpoint-preserving  transfor¬ 
mations  are  cataloged  in  section  2.8.  In  section  3, 
some  of  the  transformations  are  exercised  in  de¬ 
sign  examples.  Experimental  results  are  available 
for  a  graph-matching  network,  a  random-dot¬ 
matching  network,  and  an  approximate  sorting  net¬ 
work  which  involves  a  series  of  Fixpoint-preserving 
transformations.  For  all  these  networks,  approach 
to  a  Fixpoint  is  guaranteed  if  the  minimizing  neu¬ 
rons  operate  at  a  much  slower  time  scale  than  the 
maximizing  neurons,  but  experimentally  such  con¬ 
vergence  is  also  observed  when  the  two  time  scales 
are  close;  the  advantage  of  the  latter  mode  of  op¬ 
eration  is  that  it  requires  much  less  time  for  a  net¬ 
work  to  converge.  Finally,  a  discussion  follows  in 
section  4. 


1.1.  Reversed  Linear  Neurons  in  the 
WTA  Network 

Consider  the  ordinary  winner-take-ail  analog  neural 
network.  Following  (HopField  &  Tank,  1985),  such 
a  network  can  be  obtained  from  the  objective 

|  v,  -  1  j  +  c,  2  h-u'  +  2 

(c,  >  0),  (1) 

where  (Hopfield,  1984;  Grossberg,  1988) 

4>(v.)  =  dx  g-'(x),  (2) 

using  steepest-descent  dynamics 

u,  =  -aE/au,  (3) 

or  Hopfield-style  dynamics 

“i  =  -a£/au„  Vi  =  g(u,).  (4) 

The  resulting  connection  matrix  is 
T,j  =  -c,, 

which  implies  global  connectivity  among  the  neu¬ 
rons:  if  there  are  N  neurons,  there  are  Nl  connec¬ 
tions.  It  is  well  known  that  the  winner-take-all  circuit 
requires  only  ri(fV)  connections  if  one  introduces  a 
linear  neuron  a  whose  value  is  always  2,  v,.  It  is  not 
so  well  known  that  this  can  be  done  entirely  within 
the  objective  function,  as  follows: 

£_.(V>  -  c,  (x  ».  -  l)  <r  -  !<-■ 

+  c,  2  KVi  +  s  iKu,),.  (5) 

i  i 

where  the  steepest-descent  dynamics,  for  example, 
is  modified  to  become 

i>i  =  —  r,dEI  dVj 

=  -dEldv i  (specialize  to  r,  =  1)  (6) 

or  =  +r,dEJdo  ( r ,  >  0) 

=  c,r,  (-a  +  2  *  ~  l) 

=  0  if  <7  =  2  v,  -  1.  (7) 

But  if  <7  —  It  Vi  -  1  then  one  can  calculate  that  dEI 
dv,  =  dE/dvr,  thus  eqns  (1)  and  (5)  have  the  same 
Fixpoints.  The  connectivity  implied  by  counting  the 
monomials  in  eqn  (5)  is  t?(/V)  connections,  the  min¬ 
imum  possible  for  this  problem. 

Note  that  the  a  linear  neuron  actually  behaves  so 
as  to  increase  the  objective  E(V ,  or),  while  the  v, 
neurons  act  to  decrease  it;  a  may  be  called  a  reversed 
neuron.  Reversed  neurons  introduce  a  new  element 


of  competition  into  a  network;  indeed,  two-person 
zero-sum  games  are  usually  modelled  using  objec¬ 
tives  which  one  piaver  increases  and  the  other  de¬ 
creases  (von  Neumann  and  Morgenstern,  1953).  So 
in  this  network,  and  in  the  others  we  will  introduce, 
minimization  is  replaced  with  finding  a  saddle  point 
and  the  problem  becomes  hyperbolic.  This  follows 
immediately  from  the  sign  of  <jl  in  (5),  which  elim¬ 
inates  all  local  minima.  Fortunately  there  are  hy¬ 
perbolic  versions  of  such  efficient  optimization 
procedures  as  the  conjugate  gradient  method;  Leun- 
berger  (Luenberger,  1984)  gives  two  examples. 

For  finite  r,,  o  is  a  delayed  version  of  the  sum 

v,  -  1  and  although  the  network  dynamics  are 
different  from  eqn  (3),  the  fixed  point  is  the  same. 
Alternatively,  the  rate  parameter  r,  may  be  adjusted 
to  make  a  move  at  a  different  time  scale  from  the 
rest  of  the  neurons.  As  r,  approaches  infinity,  a  be¬ 
comes  an  infinitely  fast  neuron  whose  value  is  always 

a  =  V  v,  -  1 

and  the  other  neurons  see  an  effective  objective 

£«.(v.  cr(v))  =  £„.(v)  (8) 

so  that  their  dynamics  become  identical  to  that  of 
the  original  fully  connected  winner-take-all  network, 
which  is  guaranteed  to  approach  a  fixpoint  since 
dE^Jdt  <  0  and  £wia  is  bounded  below.  There  are 
very  efficient  serial  and  parallel  implementations  for 
networks  with  r,  — ►  w,  which  update  the  infinitely 
fast  neuron  whenever  its  ordinary  neighbors  change; 
this  is  a  standard  trick  in  neural  network  and  Monte 
Carlo  physics  simulations.  When  it  is  applied  to  sim¬ 
plify  the  simulation  of  the  WTA  equations  of  motion, 
we  arrive  at  the  standard  WTA  tnck. 

Moody  (Moody,  1989)  independently  discovered 
an  objective  function  equivalent  to  (5)  and  first  sim¬ 
ulated  its  delayed  Hopfield-styie  equations  of  motion 
(see  (4)).  But  he  remained  unaware  that  the  a  neuron 
acts  to  maximize  rather  than  minimize  the  objective, 
driving  the  system  to  a  saddle  point.  Platt  and  Barr 
(1988,  1987)  performed  constrained  optimization  (in¬ 
cluding  multiple  WTA  constraints)  by  using  a  subset 
of  neurons  that  explicitly  increase  the  objective,  and 
hence  a  network  that  seeks  saddle  points  rather  than 
local  minima.  Indeed  reversed  neurons  are  a  gen¬ 
eralization  of  their  analog  Lagrange  multiplier  neu¬ 
rons,  which  were  found  earlier  in  a  non-neural 
context  by  Arrow  (1958). 

Both  reversed  neurons  and  Lagrange  multiplier 
neurons  act  to  maximize  an  objective  which  other 
neurons  act  to  minimize.  The  difference  is  that  La¬ 
grange  multiplier  neurons  must  appear  linearly  in  the 
objective.  General  reversed  neurons  can  have  self¬ 
interactions;  in  particular,  the  potential  of  a  linear 


reversed  neuron  like  a  in  the  winner-take-all  network 

is 


4*  S ■'(*)• 


where  g(x)  =  -gaX, 


(9) 


that  is.  linear  reversed  neurons  have  linear  transfer 
functions  with  negative  gain.  Thus  in  circuit  language 
they  are  just  inverters,  which  happen  to  occur  in  a 
network  with  an  objective  function  and  to  act  so  as 
to  increase  £.  Lagrange  multiplier  neurons,  on  the 
other  hand,  have  no  potential  term  and  are  not  in¬ 
verters.  It  is  also  worth  noting  that  Lagrange  mul¬ 
tiplier  neurons  work  best  in  conjunction  with  an 
additional  penalty  term  hfufll  (where  h(v)  =  0  is 
the  constraint)  and  the  penalty  term  can  be  efficiently 
implemented  using  one  reversed  neuron  per  con¬ 
straint.  as  we  will  see. 

If  in  addition  to  being  reversed,  a  neuron  is  also 
infinitely  fast,  then  it  may  be  necessary  to  restrict  its 
connectivity  in  order  to  efficiently  simulate  or  im¬ 
plement  the  network.  One  possible  design  rule  is  to 
entirely  prohibit  connections  between  infinitely  fast 
neurons;  this  prevents  one  from  having  to  solve  a 
system  of  linear  equations  in  order  to  update  a  set 
of  infinitely  fast  neurons.  We  will  not  generallv  as¬ 
sume  chat  reversed  neurons  are  infinitely  fast. 


2.  THEORY 

The  reversed  linear  neuron  is  applicable  in  many 
circumstances  beyond  the  winner-take-all  network. 
We  can  begin  to  see  its  generality  by  considering 
objectives  of  the  form 

£(v)  =  £„(v)  -  (c/2)ATv), 

where  £„  and  X  are  any  algebraic  expressions,  and 
c  is  a  constant  of  eithersign.  This  may  be  transformed 
to 


£(v,  a)  =  £„  -  cXa  -  (c/2)<7: 


and  if  A"  is  a  polynomial,  this  represents  a  reduction 
in  the  number  and  order  of  the  monomials  that  occur 
in  £.  The  transformation  technique  used  here  is  sim¬ 
ple  to  state:  find  squared  expressions  (c/2) X:  as  sum¬ 
mands  in  the  objective  function,  and  replace  them 
with  cXc  —  (c/2)cr:.  Here  c  must  be  a  constant  and 
cr  is  a  new  linear  intemeuron,  reversed  if  c  is  positive. 
Most,  but  not  all.  of  our  reversed  linear  interneurons 
wiil  be  introduced  this  way.  The  transformation  may 
be  abbreviated  as 

iY- — * 


Xa  -  )ff:. 


(10) 
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We  employ  the  steepest-ascent-descent  dynamics 

V,  =  —  3£'3u. 

=  -dEJdu,  -  cadXJdv,, 
a  =  +(rJc)dE/do 
=  r£X  -  a). 

which,  at  a  fixpoint,  has  X  ~  a  and  3£„/3o,  -f  cXdXI 
dv;  =  dEldv,  =  0.  Likewise  a  fixpoint  of  £  can  be 
extended,  by  setting  a  =  X,  to  a  fixpoint  of  £  So 
fixpoints  are  preserved  by  the  transformation  (10). 
The  argument  also  works  if  some  of  the  u,  are  already 
reversed  neurons,  so  that  v,  =  dzdEJBv,. 

As  an  example  of  the  transformation  (10),  one 
can  robustly  implement  a  constraint  h(u )  =  0  using 
both  a  penalty  term  ch2l 2  and  a  Lagrange  multiplier 
neuron  The  objective  becomes 

a.  ;.)  =  ccrh(\)  -  cc'-!2  +  kh(v).  (11) 


2.1.  Products  and  Order  Reduction 

From  the  transformation  (10)  we  may  deduce  two 
others,  which  are  applicable  whenever  a  summand 
of  an  objective  is  a  product  XY  of  expressions  X  and 
Y.  Such  products  are  common  and  can  be  expensive; 
for  example  if  X  and  Y  are  each  sums  of /V  variables, 
then  expanding  their  product  out  into  monomial  in¬ 
teractions  gives  N1  connections.  But  only  0(/V)  con¬ 
nections  are  needed  if  one  uses  the  transformation 

XY  =  i(X  +  Y):  -  JAT-  -  ) Y- 
- *  (X  +  Y)a  -  X  r  -  To* 

—  i<T:  T  +  {(J: 

- *  X(a  -  r)  +  y(cr  -  w) 

-  i<7:  +  )r-  -f  lar.  (12) 

Here  a  is  a  reversed  linear  neuron,  and  r  and  co  are 
ordinary  linear  neurons.  If  all  three  linear  interneu¬ 
rons  are  infinitely  fast,  which  is  easy  to  stimulate 
since  they  are  not  directly  connected,  then  the  trans¬ 
formation  does  not  change  the  dynamics  of  the  rest 
of  the  variables  in  the  network.  Otherwise,  the  dy¬ 
namics  and  the  basins  of  attraction  change,  but  the 
network  fixed  points  remain  the  same. 

This  transformation  may  be  simplified,1  to 

xy  =  *[(*  +  yy  -  (x  -  y):! 

- *  i(X  +  Y)9  -  i(X  -  Y) r  -  iff-  +  iz- 

=  \X(<?  —  r)  +  iY(i -  +  z)  -  iff J  +  \zl.  (13) 

Compared  to  eqn  (12),  this  transformation  results  in 
the  same  number  of  monomial  interactions  and  one 
less  neuron,  which  may  be  useful  on  occasion. 

Reversed  neurons  allow  one  to  transform  a  high- 
order  polynomial  objective,  monomial  by  monomial. 


'As  pointed  out  to  us  by  p,  Anandan. 


into  a  third-order  objective.  Similar  transformations 
on  the  neural  networks  or  analog  circuits  are  well 
known.  But  it  is  easier  to  do  theoretical  work  with 
the  objective,  and  by  transforming  the  objective  first, 
and  then  translating  to  a  neural  net,  one  can  obtain 
novel  third-order  neural  nets. 

We  may  expand  a  high-order  polynomial  objective 
into  monomials,  each  of  which  corresponds  to  one 
connection  or  “synapse”  of  the  associated  neural  net¬ 
work.  We  may  reduce  the  order  of  an  entire  objective 
by  reducing  the  order  of  each  monomial.  Consider, 
then,  a  single  fourth-order  monomial: 

E^(x,  y,  z,  w)  =  -  Txyzw,  (14) 

which  by  (12)  may  be  transformed  to 

£™.„..(x.  y.  -.  w.  er,  z,  (u)  =  T{xy(z  -  a) 

-T  zw(ai  -  c)  -r  iff :  -  )r:  -  )cu:}.  (15) 

Here  a.  r.  and  a>  are  linear  neurons  with  gain  1/T. 
The  order-reducing  transformation  is  illustrated  in 
network  language  in  Figure  1. 

The  same  technique  may  be  used  to  recursively 
transform  a  monomial  of  any  order  m  to  a  sum  of 
third-order  monomials,  plus  potentials  for  the  new 
linear  interneurons.  The  resulting  number  of  new 
monomials  and  interneurons  is  <d(m( log  m)1)  if  the 
reduction  is  done  in  a  balanced  way  and  if  expres¬ 
sions  like  X(g  -  r)  are  not  expanded  out  to  Xa  - 
Xr  during  the  reduction. 

Another  order-reduction  transformation,  supe¬ 
rior  in  some  circumstances,  will  be  developed  in  sec¬ 
tion  2.4. 

2.2.  Reducing  F{X) 

Often  an  objective  function  includes  a  fairly  general 
nonlinear  function  £of  an  entire  expression  X.  This 


x 


y  W 

FIGURE  1.  Order  reduction.  Arbitrary  fourth-  to  third-order 
reduction,  using  linear  interneurons.  Open  circles:  original 
neurons.  Open  squares:  ordinary  linear  interneurons.  Closed 
squares:  reversed  linear  interneurons.  Dots:  connections, 
with  strengths  as  indicated.  Equilibrium  values  of  the  inter- 
neurons  are  indicated.  After  the  transformation,  neuron  w 
receives  input  xyz  +  r:w  from  the  left  and  a  compensating 
input  of  -z‘\v  from  the  right. 


may  be  much  more  difficult  and  costly  to  implement 
directly  than  either  a  low-order  polynomial  or  a  sin¬ 
gle-neuron  potential  function  q,(v,),  because  the 
F{X)  nonlinearity  involves  an  interaction  of  many 
variables.  But  for  some  functions  F,  such  an  algebraic 
form  is  still  neurallv  implementable  by  means  of 
transformations: 

j  f(u)  du - *Xa  -  j  f~'(u)du  (/ invertable). 

(16) 

Note  that  X  no  longer  appears  inside  the  function 
F  =  /  /.  The  validity  of  this  transformation  (eqn 
(16))  may  be  proven  by  noting  that  the  optimal  value 
of  a  is  f(X),  and  then  either  integrating  by  parts  the 
expression 

cx 

f-'(u)  du  =  vf{v)  du. 

or  else  differentiating  both  pre-  and  post-transfor¬ 
mation  objectives  with  respect  to  X. 

In  this  way  one  can  treat  functions  exp  X, 

] A'jlogj AT!,  logjA),  and  \X\”  of  arbitrary  algebraic 
expressions  X: 


e x - •  (X  -r  1)<7  -  a  log  a. 

l*|log!*| - *  \X\{o  -  1)  -  «*. 

log|A'l - *  Xa  -  iog|ff|, 


1 


1  +  p 


1*1 


1  *p 


Xa  -  TTV-pW"* 

(p  *  -i. 0). 


(17) 


Thus,  neural  nets  may  be  constructed  from  some 
highly  nonpolynomial  objectives. 

The  interneurons  may  still  be  reversed,  but  are 
no  longer  linear,  in  this  kind  of  transformation.  The 
potential  <f>(cr)  permits  a  possible  efficiency  tech¬ 
nique.  The  dynamics  of  eqn  (4)  is  expected  to  be 
more  efficient  than  eqn  (3),  since  it  may  be  viewed 
as  a  quasi-Newton  method  which  takes  into  account 
the  potential  but  not  the  interaction  part  of  a  neural 
net  objective  (as  shown  by  J.  Utans,  1989).  A  related 
update  scheme  for  the  a  reversed  interneuron  is 

a  =  /(j),  i  =  r,  dElda  =  r,  (X  -  s), 

u,  =  g.  (“.).  d,  -  -d£(v,  a)ldu„ 


exponential  transfer  function  would  lead  to  F(X)  = 
exp  X.  It  might  be  possible  to  characterize  a  partic¬ 
ular  technology  by  a  list  of  the  basic  forms  of  objec¬ 
tives  it  makes  available,  with  their  respective  costs 
and  restrictions,  and  to  compile  general  networks 
into  the  desired  forms  by  using  a  catalog  of  algebraic 
transformations. 

Equation  (16)  generally  may  be  used  to  transform 
a  term  F(X)  in  an  objective  by  transferring  the  non¬ 
linearity  due  to  F  from  F(X)  to  the  single-neuron 
potential  4>(a)  =  J”  du.  By  transferring  the 

nonlinearity  from  an  interaction  term  (i.e.,  a  sum¬ 
mand  of  the  objective  which  involves  several  dynam¬ 
ical  variables)  to  a  single-neuron  potential  term,  one 
cannot  only  decrease  the  cost  of  implementing  a  net¬ 
work  which  uses  gradient  methods  for  optimiza¬ 
tion,  but  one  can  transform  unimplementable  objec¬ 
tives  into  implementable  ones.  For  example,  one 
might  regard  the  class  of  multivariable  polynomials 
as  the  “implementable”  interactions  in  a  certain  tech¬ 
nology  (Rumelhart,  Hinton  &  McClelland,  1986a). 
(In  this  case  the  word  “interaction”  is  usually  re¬ 
served  for  a  multivariable  monomial,  out  of  which 
polynomials  are  built  by  addition  of  objectives.) 
Then  one  might  use  eqn  (16)  to  reduce  other,  far 
more  general  interaction  objectives  to  the  imple¬ 
mentable  form. 

Of  course  the  required  potential  <p{o)  may  itself 
be  “unimplementable”.  but  approximating  its  gra¬ 
dient  with  a  small  circuit  is  likely  to  be  far  more 
tractable  than  approximating  VF(Af)  because  <f>,  un¬ 
like  F.  is  a  function  of  just  one  dynamic  variable. 
An  approximation  of  <b'{c)  might  be  formulated  as 

<p(a) - * 

where  each  <p(a)  is  regarded  as  an  implementable 
self-interaction  and  c„  are  adjustable  coefficients. 


2.3.  Reducing  F{X):  Examples 

We  exhibit  two  examples  of  the  objective  function 
transformation  of  eqn  (16),  with  dynamics  (18).  First 
consider  the  toy  problem  of  optimizing 

E(x)  =  e‘  +  e~’. 


which  is  an  alternative  to  direct  steepest-ascent-de- 
scent.  This  dynamics  has  the  distinct  advantage  of  a 
simple  interpretation  in  terms  of  analog  electrical 
circuits  (Hopfield,  1984).  For  example,  F{X)  = 
X(\ogX  -  1)  requires  a  special  neuron  whose  trans¬ 
fer  function  is  logarithmic.  This  can  be  provided, 
approximately  and  within  a  mildly  restricted  domain 
of  the  input  values,  in  analog  VLSI  (Sivilotti,  Ma- 
howald  &  Mead,  1987).  Similarly  F(X)  =  log  X 
would  require  a  transfer  function  f{s)  =  1  Is.  and  an 


One  of  the  exponentials  may  be  taken  to  be  the 
potential  of  a  neuron  whose  transfer  function  is  ex¬ 
ponential;  such  a  transfer  function  is  implementable 
in  many  technologies  and,  like  the  logarithmic  trans¬ 
fer  function,  might  be  part  of  a  standard  component 
library  for  analog  neural  nets.  Adding  the  other  ex¬ 
ponential  would  however  require  a  special  modifi¬ 
cation  to  this  transfer  function,  and  their  sum  might 
not  be  in  the  standard  component  library.  So  let  us 
move  the  second  exponential  nonlinearity  to  another 
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neuron  whose  transfer  function  is  logarithmic: 

E(x,  g)  ~  e‘  -  xa  -  a  log  a  +  a, 
whence  the  dynamics  (18)  become 

g  =  e\  s  =  r,(- x  -  s),  (19) 

s  =  log  u,  u,  =  r,(a  -  u). 

The  evolution  of  this  two-neuron  network  is  shown 
in  Figure  2,  along  with  a  contour  map  of  the  saddle- 
shaped  objective  £  Note  that  for  quick  descent, 
r,  ~  r,  is  preferred.  Despite  the  saddle  point,  and 
despite  the  potential  numerical  sensitivity  of  expo¬ 
nential  and  logarithmic  transfer  functions,  the  net¬ 
work  functions  well. 

A  second  example  is  the  linear  programming  net¬ 
work  of  (Tank  &  Hopfield,  1986)  which  can  also 
be  interpreted  as  an  application  of  transformation 
(16)  with  r»  — >  o o  in  the  dynamics  of  eqn  (18).  The 
linear  programming  problem  is  to  minimize  A  ■  v 
subject  to  a  set  of  constraints  D;  •  v  >  Br  Their 


objective  is 

£(vj  =  2  Av,  -r  2  F  ^2  D„v,  -  s, ) 

+  2  (large  g„), 

where  dF{x)idx  =  j\x)  —  max(Q,  —  x)  penalizes 
violations  of  the  inequality  constraints  and  proved  to 
be  electronically  implementabie.  Transforming  ac¬ 
cording  to  (16)  we  get 

£lv]  =  X  A,u,  +  2  a,  (2  D„u,  -  5,j 

~  2  [ '  ds  +  2  ur'go 

I  '  1 

and  equation  of  motion 

<r,  =  Sj  =  r.  ( 2  Ofv,  -  B,  -  r,  ) 

V  ■  /  (20) 

v,  =  g„w„  u,  =  r.  (  -  u,  -  A,  -  2  ■ 


FIGURE  2.  Exponential  and  logarithmic  neurons.  The  minimum  of  eJ  +  e"'  occurs  at  the  saddle  point  of  e*  —  xa  —  a  log 
cr  +  cr,  whose  contours  are  plotted  here.  Also  various  two-neuron  trajectories  to  the  saddle  point  are  shown,  in  which  x  or  a 
moves  more  slowly  than  the  fastest  implementabie  time  scale,  assumed  to  be  r  =  1.  (a)  r,  =  1,  r.  =  .1,  (b)  r,  =  1,  r,  =  .3, 
(c!  ri  =  1.  r.  =  1,  (d)  r,  =  .3,  r.  =  1,  (e)  r,  =  .1,  r,  =  1.  Dots  occur  every  10  time  constants,  so  (c)  gives  quickest  convergence. 


This  network  approaches  a  saddle  point  rather  than 
a  minimum,  but  the  r,  — •  =  version, 

u,/>„  -  —u,  -  ,4.  -  V  D„f  j^2  D„u,  -  B, ) 

is  exactly  the  network  dynamics  of  eqn  (17)  of 
(Tank  &  Hopfield,  1986).  * 

2.4.  Interacting  Expressions:  Reducing  G(X,  L) 

Until  now  we  have  attempted  to  reduce  all  interac¬ 
tions  to  the  forms  xy  and  x yz,  but  those  may  not  be 
the  only  cost-effective  few-variable  interactions  al¬ 
lowed  by  a  given  physical  technology.  If  others  are 
allowed,  then  there  are  one-step  transformations  for 
a  class  of  functions  of  two  arbitrary  expressions  G(X, 
Y).  In  this  way  the  set  of  objectives  with  known  low- 
cost  neural  implementations  can  be  expanded  to  in¬ 
clude  common  algebraic  forms  such  as  XI Y  and 
XF(Y). 

From  eqn  (16)  we  may  derive  a  generalization  to 
functions  of  two  expressions,  G(X,  Y): 


(which  implies  (22))  and 

Y  |  f(u)  du - ►  XYg  -  Y  J  du  f-'(u)  (24) 

assuming  f  =  F  is  invertaole  and  f~l  =  (F)~l  is 
differentiable. 

Monomial  order  reduction  can  sometimes  be  ac¬ 
complished  more  cheaply  using  x  log  y  interactions 
than  third-order  ones.  If  v,  are  all  restricted  to  be 
positive,  then 

I!  u  =  exP  2  lo§ 

<•  I 

- *  0  ^2  1°S  y.  +  lj  -  a  log  a  (25) 

=  2  a  log  y.  +  <r(l  -  log  a). 

This  objective  has  0(m)  interactions  of  the  new  type. 
The  fixpoint  value  of  cr  is  n,  v,  at  which  point  the 
steepest-descent  input  to  v,  is  -  a/v,  =  -FI,,,  vr 
A  product  of  expressions  could  be  further  reduced 
using  eqn  (23)  and  xe"  interactions: 


du 


j  du  g(u.  v 
Xa  -  j‘  du  jY 


(u)  (note  function  inverse] 
du  g{u,  u)  [by  (16)] 


* a 

du  g(u,  u) 

. 


~  Xa  -  J  dv 

- »  Xa  -  Yx  +  ]  du 

[by  (16)]. 

Thus, 


du  g(u ,  u) 


(«) 


(y).  (2i) 

Of  course  the  inverse  functions  must  exist  for  this 
transformation  to  be  valid,  and  this  restricts  G. 
Taking 

g{u,  u)  =  1/2 Vuo 

we  can  derive  the  transformation  -  YIX  — »  Xa  - 
Yx  -  a/x.  Rescaling  Y  and  a  by  —  1,  then  switching 
X  for  y  and  a  for  t,  this  is  equivalent  to 

X/Y - ►  Xa  -  Yx  ^  x/a  (22) 

which  is  linear  in  r  but  effectively  nonlinear  due  to 
the  optimization  of  a. 

From  eqn  (21),  Appendix  A  denves  two  trans¬ 
formations  for  the  special  form  YF(X): 


du 


■y 

du  g(u,  u) 


(u) - ►  Xa  -  Yx 

J  du  j  du  g(u,  u) 


n  w  — *  2  (ffT.  - 

••  t  <* 

+  ff(l  -  logtx).  (26) 

It  has  been  pointed  out  to  us  (Simic,  1989)  that 
the  transformations  for  F{X)  and  G(X.  Y),  and 
hence  all  the  transformations  discussed  so  far,  can 
be  interpreted  as  Legendre  transformations  (Cour- 
ant  &  Hilbert,  1962). 


2.5.  Min  and  Max 

The  minimum  or  maximum  of  a  set  of  expressions 
{ATJ  can  be  implemented  using  a  winner-take-all  net¬ 
work  in  which  each  expression  is  represented  by  a 
neuron  oa  which  competes  with  the  others  and  is 
constrained  to  lie  between  0  and  1.  Indeed,  1,  Xaa „ 
attains  the  value  min„  X,  when  the  correct  repre¬ 
sentative  wins,  and  also  provides  inhibition  to  the 
representatives  in  proportion  to  the  values  of  the 
expressions  they  represent,  so  that  the  correct  rep¬ 
resentative  will  win.  The  potential  <$>(a)  that  occurs 
in  the  WTA  network  must  have  only  one  minimum, 
so  that  there  is  no  hysteresis  in  the  circuit,  and  must 
closely  approximate  a  square  well  (i.e.,  must  have 
high  gain).  Under  these  circumstance,  we  can  trans¬ 
form 


£  =  nun  X, - »  £  X.a.  +  C  2  a-  ~  1  x 


C  .. 

—  —  A*  •+■ 


YF(X) 


-Xa  +  Yx  +  aF-'(x) 


(23) 


•> 


2*00  (27) 


.c.  .w/uonoj  u flu  uui.c. 
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(where  C  and  the  gam  of  <f>  are  sufficiently  large) 
and  at  any  fixpoint  of  c  the  derivatives  of  E  with 
respect  to  all  other  dynamical  variables  will  be  pre¬ 
served.  So,  fixpomts  will  be  preserved. 

Likewise 


max  X. 


-  7  k-  -  2  0(0-  (28) 

“  9 


An  alternate  transformation  for  max  proceeds 
through  the  identity 


max  -  lim  2  (*.  >  0). 


(29) 


These  transformations  have  application  in  the  de¬ 
sign  of  some  standard  “content  addressable  memo¬ 
ries”  for  which  the  ideal  objective,  which  must  be 
translated  to  a  form  polynomial  in  its  interactions, 
may  be  taken  as 

£cam(v)  =  -  max  v  •  r-  -  ev  •  h,„„,  -r-  2  <6=,(u.) 


with  -1  s  u,  <  1.  Using  the  transformation  (28) 
yields  a  CAM  with  one  “grandmother  neuron”  (rep¬ 
resentative  neuron)  per  memory,  and  a  WTA  net 
among  them  (closely  related  to  a  network  described 
in  Moody  (1989)): 


£ca«(v,  a)  = 

-2  vrv.o*  -  £V  ■  h,w,  +  c  cr„  -  ljx 

-  Cc/2  +  2  ®0„(O  +  2  0„(»,). 

Another  efficient  CAM  design  (Gindi,  Gmitro  & 
Parthasarathy,  1987;  Moody,  1989)  may  be  derived 
by  applying  transformation  (29)  to  the  max  expres¬ 
sion: 


max  |v  ■  v"| 


{] 2  lv  '  v1'j  ( P  large). 


which  we  replace  by  a  monotonic  function  thereof, 
(1/p)  |v  •  v/"|p,  possibly  adjusting  s  and  <p  to  com¬ 
pensate  (At  this  point  one  could  take  p  =  2  to  get 
a  quadratic  expression,  and  regenerate  the  content 
addressable  memory  of  Hopfield,  1984.)  Then  using 
(17d)  we  Find  an  implementable  neural  net  objective 
for  large  p: 

£r.„.(v,  cr)  =  -2  ITU, <7-  -  rv  ■  h.„w 

^  —  —  2  W*'1'*"  +  2  ^  =  .C«-)- 

P  , 


2.6.  Matrices,  Graphs,  and  Pointers 

One  can  apply  order  reduction  to  objectives  con¬ 
taining  polynomials  of  matrices.  For  dense  N  x  ;V 
matrices,  a  typical  term  like  tr  FI,2-.  (  A{l)  contains  NL 
scalar  monomial  interactions,  but  this  can  be  reduced 
to  C(N3  L()og  L)2).  (Here  tr  A  ®  trace  of  A  =  1, 
A,/.)  To  show  this  we  need  only  establish  the  matrix 
analog  of  transformation  (12),  which  upon  iteration 
can  reduce  an  Lth  order  matrix  monomial  to  d(L(log 
L): )  third-order  matrix  monomials  like  ABC.  Each 
of  these  involves  N3  scalar  monomial  interactions. 

Using  eqn  (12)  (one  could  also  use  (13)),  one  can 
reduce  tr  XY ,  where  X  and  Y  are  now  matrix-valued 
expressions.  This  form  has  exactly  the  same  gener¬ 
ality  as  tr  XYT  (where  YT  =  transpose  of  Y). 

tr  XYT  =  2  X<Yn 

i) 

- *  2  (X,(e„  -  +  Y„(g„  -  w„) 

>1 

-  \a;,  -  *r;  +  fcur,)  (30) 

=  tr  X(g  ~  ~)T  -r  tr  Y(g  -  co)t 

-  Jtr  ggt  -r  Itr  rrr  +  jtr  coor . 

This  transformation  preserves  the  sparseness  of  X 
and  Y  in  the  following  sense:  if  X,,  -  Yti  =  0  at  a 
fixed  point,  then  cr,,  =  r„-  =  cp,  =  0  and  the  contri¬ 
bution  of  these  neurons  to  the  gradient  of  the  ob¬ 
jective  is  also  zero. 

A  major  problem  in  neural  network  research  (c.f. 
Feldman.  1982)  is  to  reduce  the  cost  of  networks 
which  manipulate  graphs.  Usually  (Hopfield  &  Tank, 
1985.  1986;  Mjolsness.  Gindi  &  Anandan.  1989a) 
objectives  for  such  problems  involve  dense  matrices 
of  neurons  representing  all  the  possible  links  in  a 
graph.  But  the  graphs  that  arise  in  computer  science 
and  in  computer  programming  usually  have  a  rela¬ 
tively  small  number  of  links  per  node,  and  are  there¬ 
fore  representable  by  sparse  matrices.  (If  a  sparse 
matrix's  entries  are  all  zero  or  one,  it  is  equivalent 
to  a  set  of  “pointers”  in  many  current  computer  lan¬ 
guages.  Pointers  are  used  ubiquitously,  wherever 
some  fluidity  in  data  representation  is  required.) 
Since  we  have  just  shown  how  to  reduce  a  wide  class 
of  matrix  objective  functions  to  summands  of  the 
form  tr  ABC  while  retaining  sparseness,  it  becomes 
important  to  reduce  this  form  further  by  exploiting 
the  sparseness  of  the  matrices  involved: 

tr  ABC - -  ? 

where  A,  B ,  and  C are  sparse-matrix-vaiued  dynam¬ 
ical  variables.  We  do  not  yet  know  how  to  do  this 
correctly. 

One  approach  to  this  problem  is  through  the  use 
of  codes,  like  the  binary  code,  which  can  concisely 
name  the  pair  of  nodes  connected  by  each  nonzero 
matrix  element.  Zero  matrix  elements  are  not  ex- 


piicitly  encoded  and  this  is  the  advantage  of  the 
method.  A  disadvantage  of  such  codes,  for  objective 
functions,  is  that  the  configuration  space  is  altered 
in  such  a  way  that  new  local  minima  may  be  intro¬ 
duced,  though  the  old  ones  will  be  preserved.  A  code 
which  allows  order  reduction  to  proceed  most  ad¬ 
vantageously  is  used  in  the  sorting  networks  of  sec¬ 
tion  3.4. 

Another  approach  is  used  by  Fox  and  Furmansky 
(I9SS).  Their  load-balancing  network  involves  binary 
encoding,  but  the  network  evolution  is  divided  into 
a  number  of  phases  in  which  different  classes  of  neu¬ 
rons  are  allowed  to  vary  while  most  neurons  are  held 
constant.  The  connections  are  different  from  one 
phase  to  the  next,  and  do  not  recur,  so  that  the 
network  is  not  directly  implemented  in  terms  of  a 
circuit  but  rather  requires  “virtual”  neurons  and  con¬ 
nections.  A  virtual  neural  network  can  be  provided 
by  suitable  software  on  general-purpose  computers, 
or  perhaps  by  further  objective  function  transfor¬ 
mations  leading  to  a  real  (and  efficient)  neural  cir¬ 
cuit;  the  latter  alternative  has  not  be  achieved. 

2.7.  Control  of  Neural  Dynamics 

So  far  we  have  exclusively  considered  steepest-as- 
cent-descent  dynamics  such  as  eqns  (3),  (4),  or  (18), 
which  allow  little  control  over  the  temporal  behavior 
of  a  network.  Often  one  must  design  a  network  with 
nontrivial  temporal  behaviors  such  as  running  longer 
in  exchange  for  less  circuitry,  or  focussing  attention 
on  one  part  of  a  problem  at  a  time.  We  discuss  two 
algebraic  transformations  which  can  be  used  to  in¬ 
troduce  detailed  control  of  the  dynamics  with  which 
an  objective  is  extremized. 

One  transformation,  developed  by  one  of  the 
authors  in  collaboration  with  W.  Miranker  (Mjols- 
ness  &  Miranker,  1990),  replaces  an  objective  £  with 
an  associated  Lagrangian  functional  to  be  extremized 
in  a  novel  way; 

£[y] - »  £(v|q]  =  J  dt  (*[v,  v|q]  -  , 

6L/6u,(t)  =  0.  (31) 

Here  q  is  a  set  of  control  parameters,  and  K  is  a 
cost-of-movement  term  independent  of  the  problem 
and  of  £.  The  integrand  i’  =  K  +  dEldt  is  called 
the  Lagrangian  density.  Ordinarily  in  physics,  La¬ 
grangian  dynamics  have  a  conserved  total  energy 
which  prohibits  convergence  to  fixed  points.  Here 
the  main  difference  is  the  unusual  functional  deriv¬ 
ative  with  respect  to  i>  rather  than  v.  This  is  a 
“greedy”  functional  derivative,  in  which  the  trajec¬ 
tory  is  optimized  from  beginning  to  end  by  repeat¬ 
edly  choosing  an  extremal  value  of  v(r)  without 
considering  its  effect  on  any  subsequent  portion  of 


the  path: 
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du,(l)  J- 


dt'  X‘  [ v ,  v]  =  3(0) 
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du,(t) 
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=  <3(0) 
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5v(t) 


5  L 

Sv,(t)' 


(32) 


SL  =  dK  _  dE 
Sv,  du,  dv,  ’ 


transformation  (31)  preserves  fixpoints  if  dKldv ,)  = 
0  «  v  =  0. 

For  example,  with  a  suitable  K  one  may  recover 
and  improve  upon  steepest-ascent-descent  dynamics: 


£[vj - »  L[v|r,  J]  =  |  dt  s,d>:i(0i/r) 

+  2  (3£/3u,)u,j,  (34) 

0  =  SLIS0,(t)  =  s,cj>'zl(vJr)/r  +  dE/du,,  i.e. 

0,  =  rg.,(—s,rdEJdu,), 

where  the  transfer  function  -  1  ^  g-i(x)  s  1  reflects 
a  velocity  constraint  -  r  <  y,  <  r,  and  as  usual  g  = 
(<j>')‘‘.  The  constants  5,  =  1/s,  =  -1  are  used  to 
determine  whether  a  neuron  attempts  to  minimize 
or  maximize  £  and  L.  If  all  s,  =  1  then  dEldt  <  0 
and  eqn  (34)  is  a  descent  dynamics. 

Another  transformation  (proposed  and  subjected 
to  preliminary  experiments  in  Mjolsness,  19S7)  can 
be  used  to  construct  a  new  objective  for  the  control 
parameters,  q,  through  their  effect  on  the  trajectory 
«('): 


E[v] - .  E[q]  =  2  {jj^J  +  ^-.[9] 

=  ^+£49],  if  all  5,  =  1.  (35) 

In  (Mjolsness,  1987)  the  s,  =  1  version  of  transfor¬ 
mation  (35)  (but  not  (34))  was  used  to  introduce  a 
computational  “attention  mechanism”  for  neural 
nets  as  follows.  Suppose  we  can  only  afford  to  sim¬ 
ulate  R  out  of  N  $>  R  neurons  at  a  time  in  a  large 
net.  The  R  neurons  can  be  chosen  dynamically  via 
control  parameters  -  r,  £  [0,  1]  in  eqn  (34),  with 
r,  =  0  for  all  but  R  active  neurons.  For  high-gain  g=i 
we  have  g-i(x)  =  sgn(x)  and 

dE  ^  dE 
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(Whether  the  gain  is  high  or  not,  g=1  is  an  odd  func¬ 
tion  so  dEldt  s  0  and  any  dynamics  for  r  yields  a 
descent  algorithm  for  E.)  The  objective  for  r  is 

*i  - -Hfj  *  f 2 

which  describes  an  £ -winner  version  of  a  WTA  net¬ 
work  that  determines  which  R  neurons  should  be 
active  and  which  N  -  R  should  be  frozen. 

An  especially  simple  and  cheap  dynamical  system 
for  r  is  to  keep  all  r,  constant  most  of  the  time,  but 
every  so  often  to  interrupt  the  simulation  of  SLI 
<5v  =  0  and  completely  relax  £fr].  This  amounts  to 
re-sorting  the  neurons  u,  according  to  their  gradient 
magnitudes  |d£/au,|,  and  selecting  only  the  first  R 


neurons  to  be  active  (r,  =  1).  When  simulating  a 
sparsely  connected  neural  net  on  other  underlying 
hardware,  this  algorithm  can  be  implemented  very 
cheaply  since  most  v  gradients  are  unchanged  be¬ 
tween  phases  of  r  relaxation,  and  therefore  the  new 
sorted  order  is  just  a  minor  refinement  of  the  pre¬ 
vious  one.  The  result  is  necessarily  a  descent  algo¬ 
rithm  for  £(y],  with  only  R  neurons  active  at  any 
time. 


2.8.  List  of  Transformations 

Let  X  and  T  be  any  algebraic  expressions,  containing 
any  number  of  variables.  We  list  the  following  fix- 
point-preserving  transformations  of  objectives,  or 
summands  thereof: 


2  X„ff.  +  C  ~  1  ~  \  *■*  +  2 

(Large  C,  and  high-gain  hysteresis-free 
barner  function  <t>  which  confines  a.  to  (0,  1).) 

2  x.<r.  i  y  +  |  -i*  -  2 

(Same  conditions.) 

2  ~  IXjco.  +  co.e'*)  +  <7(1  -  log  <7). 

tr  XYr - *  tr  X{a  -  r)r  +  tr  Y(a  -  co)T  —  Jtr  acT  +  itr  rrr  +  $tr  cucur 

(All  matrices,  possibly  sparse.) 

EJv]  - *  L(v|q]  =  J  df^KIv|v,  q]  +  <5L/<5v(f)  =  0 

{SKI 3v  =  0  o  V  =  0) 

£[v j  — *  £[q]  =  2  +  £=Jq]  (s,  =  -1) 


The  variables  c ,  r,  a>,  and  /.  are  assumed  not  to  occur 
elsewhere  in  the  original  objective.  Note  that  each 
transformation  may  have  restrictions  on  its  applica¬ 
bility,  in  addition  to  the  particular  form  it  matches. 

We  will  report  experiments  only  with  transfor¬ 
mations  1.1,  1.2,  2.2,  and  5.1  on  this  list.  Experi¬ 
ments  with  transformations  6.1  and  6.2  will  be 
reported  in  a  later  paper  (Mjoisness  &  Miranker, 
1990).  The  rest  are  still  theoretical. 

These  transformations  may  be  iterated,  at  the  ex¬ 
pense  of  creating  interactions  between  the  added 
variables.  They  can  be  used  to  reduce  the  nonline¬ 
arity  of  the  interactions  in  a  neural  network,  trans¬ 
ferring  such  nonlinearity  to  single-neuron  potentials 
or  distributing  it  among  several  simpler  interactions. 

3.  DESIGN  EXAMPLES 

3.1.  Convolutions  and  Coordinate  Transformations 
Discrete  convolutions 

0.-2  K-,/, 

/ 

(where  index  subtraction  is  defined  appropriately) 
and  linear  coordinate  transformations 

x:  =  T  A.,x,  +  b, 

can  both  be  expressed  as  sums  of  squared  penalty 
terms  in  an  objective: 

-  f  2  (o.  -  2  (37) 

or 

-  \  2  (*'  -  2  AnX,  +  b)j,  (38) 

which  c  >  0.  (Equation  (38)  subsumes  eqn  (37).) 
Alternatively  the  convolution  or  coordinate  change 
could  be  turned  into  a  hard  constraint  by  using  La¬ 
grange  multiplier  neurons,  but  those  procedures  still 
work  best  when  a  penalty  term  exactly  like  eqn  (37) 
or  eqn  (38)  is  added  to  the  objective  (Platt  &  Barr, 
1988;  Luenberger,  1984).  As  they  stand,  these  ob¬ 
jectives  expand  into  very  expensive  networks  due  to 
the  spurious  squaring  of  the  matrix.  That  is  because 
convolution  kernels,  which  are  usually  constant  but 
sparse,  have  their  fanout  squared;  and  coordinate 
transformations,  which  are  usually  dense  but  vari¬ 
able,  have  an  excessive  number  of  new  (high-order) 
interactions  created  when  A  is  squared. 

Of  course,  eqn  (38)  is  of  the  type  which  we  know 
how  to  transform  using  reversed  linear  neurons.  We 
obtain  the  modified  objective 


which  doesn’t  square  A.  If  A  is  constant  then  there 
is  no  order  reduction,  since  both  £coord  and  £coora  are 
second  order,  but  there  are  fewer  connections  unless 
A  is  also  dense. 

An  alternative  objective,  not  using  reversed  neu¬ 
rons,  is  also  available  for  convolutions  and  coordi¬ 
nate  system  transformations.  The  objective 

£ooru  ^  2  A„(x,  6,  x.)-  (40) 

~  <1 

is  minimal  with  respect  to  x'  when 

x,'  =  V  A„x,j  2  A„  -  b„  (41) 

This  type  of  dynamic  normalization  may  be  desira¬ 
ble,  or  if  A  is  constant  and  already  normalized  then 
it  does  not  hurt.  Equation  (40)  also  preserves  any 
sparseness  of  A ,  and  does  not  square  the  matrix. 


3.2.  Random  Dot  Matching 

Here  the  problem  begins  with  two  inputs:  a  planar 
pattern  of  n  dos  specified  by  their  independent  ran¬ 
dom  positions  x,  within  a  unit  square,  and  a  pattern 
of  m  <<  n  dots  specified  by  their  positions  y„.  The 
ya  are  generated  by  randomly  selecting  m  of  the  x 
dots,  independently  perturbing  their  positions  by 
random  displacements  of  1/10  the  size  of  the  square, 
and  globally  shifting  all  m  dots  by  a  translation  vector 
A.  The  problem  is  to  reconstruct  A  from  {xf}  and  {y„}, 
by  consistently  matching  the  dots.  Since  the  match 
is  parameterized  by  a  few  geometric  parameters,  this 
is  really  an  image  “registration’’  problem. 

A  simple  objective  for  this  task  is 

£««[A]  =-2  2  exP(~lx.  “  y.  -  A|2/2K:),  (42) 

where  the  search  width  K  is  to  be  set  initially  to  the 
width  of  the  square  (1.0)  and  gradually  decreased 
down  to  the  size  of  the  geometric  noise  (0.1)  as  the 
optimization  proceeds,  in  order  to  find  better  local 
minima.  This  objective,  and  the  gradual  change  in 
K,  is  quite  similar  to  that  of  the  elastic  net  approach 
to  the  traveling  salesman  problem  (Durbin  &  Wil- 
Ishaw,  1987).  Now  £dou  may  be  transformed  to  re¬ 
move  the  exponential  from  the  interactions: 

£*«(A.  r]  =  2  lx’  _  y.  ~  A|:t„ 

+  2  '-(log  T.  -  1)  (43) 

with  dynamics 

A  =  (1  IK1)  2  r,(x  -  y.  -  A), 

ad 

t.„  =  exp  «... 
w,  = 


c 


-oj,  -  (1/2/C'-)  |x,  -  y.  -  A|'-. 


(44) 
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We  experiment  with  n  =  30  and  m  =  6.  In  Figure 
3  we  have  shown  a  contour  map  of  £dou,  which  is  to 
be  minimized,  along  with  the  projection  of  a  K  = 
.2  trajectory  onto  the  A  plane.  The  initial  condition 
came  from  partially  relaxing  the  net  at  K  =  .5  first, 
where  the  objective  is  unimodal.  The  K  =  .2  prob¬ 
lem  is  somewhat  more  difficult  than  incrementally 
updating  a  solution  in  response  to  a  small  change  in 
K ,  but  the  network  found  the  right  answer.  For  n  = 
30  and  m  -  6,  some  random  patterns  of  dots  would 
require  several  large-K  minima  to  be  tracked  to  small 
K  for  correct  operation,  but  this  defect  of  the  original 
objective  (42)  did  not  arise  for  the  case  shown  here 
(or  13  out  of  15  other  cases  we  examined)  and  is  not 
relevant  to  the  validity  of  the  transformation. 

The  main  numerical  drawback  of  the  180  expo¬ 
nential-taking  neurons  in  this  net  is  that  small  time 
steps  may  be  required.  In  using  the  Runge-Kutta 
method  for  solving  eqn  (44)  the  stepsize  had  to  be 
Ar  =  .0003  near  the  starting  point,  even  though  even¬ 
tually  it  could  be  increased  to  .003. 


3.3  Graph  Matching  and  Quadratic  Match 

Consider  the  following  objective  for  inexact  graph- 
matching  (Hopfield  &  Tank,  1986;  c.f.  von  der  Mals- 
burg  &  Bienenstock,  1986): 

=  -c,  £  +  c,  2  (S  -  l) 

*‘f  *  \  •  I 

*  *  2  (2  -  1  j  +  c,  X  M.(l  ~  M„) 

»  j  (45) 

where  G  and  g  are  connection  matrices  (each  entry 
is  zero  or  one)  for  two  graphs,  and  Ma,  is  one  when 
node  a  of  G  maps  to  node  i  of  g,  and  zero  otherwise. 

The  problem  may  be  generalized  slightly  to  quad¬ 
ratic  matching  by  replacing  the  GgMM  term  with 

2  G^g„A„BSl  (46) 

aiiil 


FIGURE  3.  Random  dot  matching  network.  Trajectory  of  net  evolution  equations  (44)  projected  to  the  A  plane,  superposed 
on  a  contour  plot  of  K  was  .2.  The  starting  point  was  obtained  by  a  partial  relaxation  of  the  same  net  for  K  =  .5,  for 
which  the  objective  has  a  single  local  minimum,  starting  from  A  =(1,1)  and  5=0.  (Further  relaxation  at  K  =  .5  would  result 
in  an  initial  A  even  closer  to  the  K  =  .2  minimum).  The  correct  answer  is  A  =  0. 


and  altering  the  other  constraints  to  reflect  the  fact 
that  A  —  B.  What  we  have  to  say  aoout  graph  match- 
ms  will  appiy  equally  well  to  this  generalization. 

The  GgM\t  term  is  superficially  the  expensive  one 
since  it  involves  four  sums.  If  each  graph  is  constant, 
there  are  0(N)  nodes  in  each  graph,  and  both  graphs 
have  fanout  /,  then  the  number  of  monomial  syn¬ 
apses  is  t) (,V-f :) ■  We  can  reduce  this  to  d(N'f).  Also 
if  one  of  G  or  g  is  variable,  with  N  nodes  in  the 
variable  graph  and  m  in  the  constant  graph,  as  in  the 
“Frameville’’  networks  of  (Mjolsness  et  al.,  1989a), 
and  both  graphs  are  represented  densely,  then  the 
number  of  synapses  is  reduced  from  d(N-mf)  to 
0(N2m  4-  Nmf).  The  reduction  uses  linear  inter- 
neurons,  both  reversed  and  normal: 


E,  =  -c,  7 


-  3' 


E i  and  £,  are  illustrated  in  Figure  4. 

The  reduced  graph-matching  network  works  in 
simulation,  for  five  out  of  six  small  {N  =  10  nodes) 
hand-designed  graphs  with  low  fanout  (from  1.8  to 
2.5).  The  sixth  case  is  not  solved  by  the  original, 
untransformed  network  either.  The  parameters  we 
used  were 


N  =  10  c,  =  1.0  c,  =  1.0  c.  =  0.5 

g,(M)  =  50  ga(a)  =  1.0  *,(WTA)  =  10.0  r.  =  l 

Al  =  .004  sweeps  =  1  000 


FIGURE  a.  Graph  matching  networks.  £,  is  a  sum  over  the 
indices  a,  /?,  /,  and  /',  which  are  connected  by  neurons  (line 
segments)  in  the  shape  of  a  "rectangle ".  This  objective  can 
be  transformed  info  £.  which  is  a  sum  of  triangles,  while 
preserving  fixpoints.  The  triangles  are  obtained  trom  the  rec¬ 
tangle  by  introducing  linear  interneurons  along  a  diagonal, 
as  shown.  Only  three  indices  are  summed  over,  resulting  in 
a  less  costly  network. 


and  for  the  original  network: 

.V  =  10  c„  =  1.0  c ■  =  1.0  c,  =  0.5 

g„(M)  =  20  g„i<7)  =  1.0  g„CWTA)  =  10.0  A;  =  .004 

sweeps  =  1  000 

Here  g0(T/)  is  the  gain  g‘(0)  of  the  transfer  func¬ 
tion  g(M)  for  M,  and  similarly  ga(c)  is  the  gain  for 
the  linear  neurons  that  were  introduced  through 
transforming  £,,  namely  a,  r,  and  c o.  Also  gP(WTA) 
is  the  gain  for  tne  infinitely  fast  linear  neurons  which 
were  used  in  both  reduced  and  control  experiments 
to  implement  the  WTA  constraints;  this  parameter 
effectively  multiplies  c^.  sweeps  is  the  number  of  it¬ 
erations  of  the  forward  Euler  method  used  in  sim¬ 
ulating  the  continuous  update  equations.  Each 
iteration  advanced  the  time  coordinate  by  dr. 

There  is  only  a  little  parameter-tuning  involved 
here,  concentrated  on  r ,,  At,  and  sweeps.  The  prod¬ 
uct  At  x  ma x{r„,  rv  =  1)  should  be  held  fixed  to 
maintain  constant  resolution  in  the  discrete  simula¬ 
tion  of  continuous  update  equations.  But  holding  At 
and  the  other  parameters  fixed  at  the  quoted  values, 
the  rate  parameter  r„  can  be  varied  from  unity  to  100 
without  altering  the  network  convergence  time,  mea¬ 
sured  in  sweeps,  by  more  than  30%  or  so:  network 
performance  remains  the  same  in  that  the  same  5 
out  of  6  graphs  are  correctly  matched.  This  would 
surest,  and  other  experiments  confirm,  that  for  low 
r  there  is  some  room  for  increasing  At  and  decreasing 
sweeps  (r  =  10,  At  =  .016,  and  sweeps  =  300. 
respectively);  this  saves  time  whether  time  is  mea¬ 
sured  in  simulation  sweeps  or  in  circuit  time  con¬ 
stants. 

3.4  Sorting 

Sorting  may  be  described  in  a  manner  very  similar 
to  graph  matching.  One  requires  a  permutation  ma¬ 
trix  M  which  sorts  the  inputs  into  increasing  order, 
so  all  terms  in  the  objective  remain  the  same  except 
for  GgMM.  The  objective  becomes 

£„  =  -c,2  M.,x. y,  +  c:  2  (Z  ~  x) 

+  c.  I  ( 2  K  -  lY  -  C3  2  M„(  1  -  AQ 

+  ?  I  ldxg~'[x)-  (48) 

Here  x.  are  a  set  of  input  numbers  to  be  sorted,  and 
v(  are  a  constant  set  of  numbers  in  increasing  order 
(e.g .,yt  =  j).  M  will  become  that  permutation  matrix 
whir'  maximizes  the  inner  product  of  x  and  y  fi  e., 
maps  the  largest  x,  to  the  largest  yr  the  next  largest 
x,  to  the  next  largest  yr  and  so  cn>. 


664 


After  using  the  winner-take-all  reduction  on  the 
row  and  column  constraints  of  M,  this  network  has 
c\/V:)  connections  and  neurons.  One  cannot  do  bet¬ 
ter  without  reducing  the  number  of  match  neurons 
M,  But  M  is  sparse  at  any  acceptable  answer,  so  it 
may  be  possible  to  keep  it  sparse  throughout  the  time 
the  network  is  running  by  using  a  different  encoding 
of  the  matrix.  For  example,  one  might  encode  indices 
i,  /,  or  both  in  a  binary  representation  as  would  be 
done  in  an  ordinary  computer  program  for  sorting, 
in  which  one  commonly  uses  the  binary  represen¬ 
tation  of  pointers  to  represent  a  sparse  graph  M,,  £ 
{0,  1};  alternatively  if  M  is  always  a  permutation  one 
can  represent  it  by  a  one-dimensional  array  of  binary 
addresses  j[i\.  The  resulting  objectives  generally  still 
have  c?(/V2)  connections  (monomial  interactions),  but 
a  well-chosen  matrix  encoding,  supplemented  by 
suitable  reversed  neurons,  can  drastically  reduce  the 
number  of  connections. 

In  Appendix  B  it  is  shown  that  any  permutation 
matrix  M  of  size  Lr  x  L2  can  be  represented  in  the 
following  form: 


M  =  V  if 


i<-» 


Ail] 


(49) 


-here  i  £  {1,  .  .  .  N  =  L2},  i,  £  {1,  .  .  .  VN  =  L}, 
and  i  =  t,L  +  i:  =  (i,,  i:),  and  where  two  constraints 
apply  to  each  nonsquare  matrix: 


11 

-  £ 

X 

£  =  i 

n 

A 

s  -4:;;,,  =  i 

'i 

£  A-iil/i  =  1 
>2 

11 

-  V 

rx 

=  l 

=  f 


(50) 

The  matrix  form  (49)  contains  only  4yV2/2  variables, 
and  is  our  proposed  encoding  of  M. 

As  explained  in  Appendix  C,  eqn  (49)  is  a  coarse 
version  of  another  expression  for  M  which  contains 
c?(jV  log  N)  variables.  That  expression  codes  index 
pairs  (i,/)  using  the  “Butterfly”  connection  topology 
that  arises  in  the  fast  Fourier  transform  (FFT)  and 
in  many  other  parallel  algorithms.  The  advantage  of 
the  Butterfly  is  that  it  allows  one  to  make  a  gradual 
transition  from  one  space  (e.g.,  index  i)  to  another 
(index  ;').  There  has  been  little  success  in  transform¬ 
ing  objectives  based  on  the  much  less  gradual  binary 
or  base-b  encoding  of  i  and  AF,,-.2  An  example  similar 
to  eqn  (49)  is  the  base  V7/  code  obtained  by  listing 


all  N  links  in  the  permutation  matrix,  indexed  by  k, 
and  encoding  their  starting  and  ending  locations  as 
i  =  (i,,  h)  and;  =  Then 


M  =  Y  a(->  i<"  J (=> 

‘•Villi'.  ■'V*  ■Ci),.*  ■Cljj.i 


(51) 


subject  to  obvious  constraints  on  the  A’s. 

Since  any  permutation  matrix  can  be  expressed 
using  eqn  (49),  any  (approximately  quadratic)  local 
minimum  of  EKn(M)  (eqn  (48)),  for  which  M  is  suf¬ 
ficiently  close  to  being  a  permutation  matrix,  should 
be  a  local  minimum  of  £wn(Au>,  Aa\  A(2));  but 
there  may  be  local  minima  with  respect  to  A  and 
A  which  would  be  unstable  in  the  larger  M  space. 
Thus,  making  the  substitution  (49)  into  eqn  (48)  may 
expand  the  set  of  fixed  points,  or  alter  it  entirely  if 
the  original  objective  is  not  yet  tuned  to  produce 
permutation  matrices.  By  contrast,  the  objective 
function  transformations  used  heretofore  have  ex¬ 
actly  preserved  the  set  of  Fixed  points.  We  will  try  it 
out  anyway,  in  order  to  get  a  low-cost  net,  and  we 
will  observe  whether  and  how  well  it  sorts. 

The  problem  now  is  to  reduce  the  number  of  mon¬ 
omial  interactions  of  0(Ni,z).  This  is  easy  for  quad¬ 
ratic  penalty  terms  corresponding  to  the  constraints 
(50),  which  consist  of  8 N  winner-take-all  constraints 
each  involving  Nu 2  variables.  The  remaining  inter¬ 
action  -c  2  Mxy  can  be  reduced  in  two  stages:  sub¬ 
stitute  M  =  AAT  with  no  reduction  in  connection 
costs;  then  substitute  the  c '(N2'2)  forms  for  A  and 
A. 

Thus  we  may  replace  £,  =  -c,  2,,-  A with 
£,(A,  A)  =  -c,  £ 

•i* 


£  £  A**.  +  £  A*?, 


-  £  (£  a**.)  -  £  ^£  A* y, 


E(A,  A,  a,  b,  b)  -  -c. 


£  (A  ~  b„)  Y  A*  x> 

k  i 

-r  £  (a,  -  bt)  Y  A,*y< 


(52) 


+  i  2  (  -«i  + 


A  partial  exception  is  the  load-balancing  network  of  Fox  and 
Furmansky  (1988),  in  which  the  crucial  "histogram"  may  be 
understood  as  a  set  of  reversed  linear  interneurons  which  simplify 
their  load-balancing  objective.  But  the  result  is  a  virtual  neural 
net,  not  a  statically  connected  circuit. 


which  may  be  interpreted  as  four  interacting  sorting 
problems,  with  linear  interneurons  a  interpolating 
between  x  and  y  and  with  reversed  neurons  b  and  b 
cancelling  echos  as  in  eqn  (15).  So  far  there  is  no 
reduction  in  number  of  neurons  or  connections. 


2'ransjormauons  Of  UOjeciivts 


If  we  substitute  the  special  forms  for  A  and  A ,  we 
find 


£  -  -c,  2  2  x.A.'JjA'iKa,  -  6») 
L  >ro  *i»i 

+  2  2  ~  £) 

/in  *1*1 


-i  2  a;  +  i  2  k’»  +  i  2  & 

*  *  * 

=  -JisTfs  ^'i'Ui^'  +  2  *^>i.*i*i (fl*  ~  w 

L  n*,  L  \  >,  *1 

-  fs  4JU*.Y  -  [2  A^,h{ak  -  bk) 


+  1 2  [fs  AnUy>  +  2  ^*.*.(a*  -  k) 

a*i  L  \  /i  *2 

-  fl  A'i'I^/V  -  (2  4?*,*,(«*  ~  W 


-  i  2  «i  +  i  2  «  +  i  2  & 

*  *  *  . 

and  finally 

£  =  -ci  £2  *4!'>U-*,tei*t  “  W 

+  2  *4^*,*,(a*  -  f\)(ffi,*,  “  “>.*,) 

>i* 

-  i  2  ff**,  +  i  2  T?i*,  +  i  2  “«.*, 

n*i  n*i  n*i 

+  2  **h*«.*,  >/(**»  ~  V.)  +  2  ^n’*,*i 

1*1  n* 

x  (a,  -  -  £y,) 

-  i  2  c"n*i  +  *  2  ?n*.  +  i  2  “n*, 

n*i  n*i  n‘i 

-  }  2  a;  +  I  2  +  i  2  &  j 


(53) 


(54) 


which  has  0(An/2)  neurons  and  connections. 

In  Appendix  C  we  show  how  to  extend  this  result 
to  a  series  of  successively  cheaper  approximations  of 
the  original  sorting  network,  down  to  Q(N  log  N ) 
neurons  and  connections. 


3.5.  Sorting:  Experiments 

The  ©(A3'2)  sorting  network  only  sorts  in  an  ap¬ 
proximate  way.  The  reversed  neurons  work  correctly 
at  finite  r,  which  is  nontrivial  since  they  are  con¬ 
nected  to  each  other,  but  the  encoding  scheme  is 
prone  to  trapping  by  local  minima.  We  used  the  fol¬ 
lowing  parameter  values  for  the  0(Ar3/J)  sorting  net¬ 


work: 


N  =  16  c,  =  0.6  c2  =  6.0  e3  =  0 

go(A)  =  20  rA  =  1  r,  =  3  At  =  .01 

sweeps  =  20  000 _ 

N  =  25  c,  =  .44  cj  =  6.0  c,  =  0 

go(A)  =  20  rA  =  1  r.  =  3  At  =  .01 

sweeps  =  20  000 

and  for  the  Q(N2)  network: 

N  =  16  c,  =  0.6  c2  =  6.0  c3  =  0 

g0(A)  ==20  At  =  .01  sweeps  =  5  000 _ 

N  ==  25  c,  =  .44  cj  =  6.0  c,  =  0 

ga{A)  =  20  At  =  .01  sweeps  =  5  000 

As  in  the  graph-matching  example,  most  of  the 
parameter-tuning  was  concentrated  on  r„.  At,  and 
sweeps.  Here,  ra  is  the  rate  parameter  appearing  in 
the  update  equation  (7),  and  it  applies  to  neurons  a, 
b,  b,  a,  6,  r,  z,  co,  cb.  Likewise  rA  applies  to  the  update 
equations  for  A  and  A.  As  in  eqn  (48),  Ci  multiplies 
the  strength  of  the  permuted  inner  product  of  x  and 
y  in  the  objective.  Also  c2  is  the  strength  of  the  syntax 


Mistakes  in  16  Eletenc  Sore 


MiJcakes  in  25  Element  Sore 


FIGURE  5.  Histogram  of  placement  errors.  Sorting  network 
with  butterfly  encoding,  0(/VVI)  connections,  (a)  Size  J V  =  16. 
Average  and  standard  deviation  (upper  or  lower  half  of  an  error 
bar)  for  62  runs,  (b)  Size  N  =  25.  Average  and  standard  deviation 
for  39  runs. 


constraints,  and  c3  is  the  strength  of  a  term  penalizing 
intermediate  neuron  values.  go(A)  is  the  gain  g'(0) 
of  the  transfer  function  g(A)  for  A  and  A,  which 
obeyed  steepest-descent  dynamics.  The  constant  y 
values  are  y{  =  -(N  -  l)/2,  y;  =  -(N  -  3)/2, 
.  .  .  y.v_ i  =  (N  -  3)/2,  yiV  =  (/V  -  l)/2.  sweeps  is 
the  number  of  iterations  of  the  forward  Euler  method 
used  in  simulating  the  continuous  update  equations. 

For  input  size  N  =  16  we  find  an  average  place¬ 
ment  error  of  1.4  out  of  16  possible  output  places. 
Eight  would  be  random.  For  N  =  25,  which  is  the 
first  size  (with  integral  V/V)  for  which  there  are  fewer 
neurons  in  the  asymptotically  smaller  network,  the 
average  placement  error  is  1.7  out  of  25  places.  The 
errors  can  be  characterized  by  a  histogram  showing 
the  frequency  with  which  placement  errors  of  dif¬ 
ferent  sizes  occur.  (The  size  of  a  placement  error  is 
the  difference,  in  the  permuted  output  vector,  be¬ 
tween  the  desired  and  actual  positions  of  an  ele¬ 
ment.)  Histograms  for  N  =  16  and  N  =  25  are 
presented  in  Figure  5  and  they  show  that  small  mis¬ 
takes  are  far  more  likely  than  large  ones,  and  that 
the  frequency  falls  off  as  roughly  the  -2.1  (respec¬ 
tively,  -2.0)  power  of  the  size  of  the  placement 
error,  with  correlation  r  =  .90  (.93).  In  addition, 
17.7%  (respectively,  21%)  of  the  experimental  runs 
failed  to  meet  our  convergence  criterion,  which  was 
that  exactly  one  element  in  each  row  and  column  of 
the  computed  matrix  M  must  have  a  value  greater 
than  .5. 

Iterating  the  sort  can  improve  the  score  margin¬ 
ally,  but  not  to  the  perfect  sorts  achievable  with  the 
f}{Nl)  network  at  these  sizes. 

4.  DISCUSSION  AND  CONCLUSION 

Although  much  research  now  focuses  on  expressing 
new  computational  problems  using  objective  func¬ 
tions  and  then  deriving  neural  networks  which  solve 
them,  we  would  like  to  suggest  that  when  such  efforts 
are  successful  there  may  be  an  additional  advantage 
to  be  obtained.  If  the  solution  can  be  regarded  as  a 
novel  algebraic  transformation  of  an  unremarkable 
objective,  then  the  transformation  may  also  be  im¬ 
mediately  applicable  to  other  objectives  and  prob¬ 
lems.  This  provides  a  kind  of  reuse  of  neural  net 
design  effort  which  is  fundamentally  more  flexible 
than  the  reuse  of  modules  or  components  as  prac¬ 
ticed  in  electronic  design  (Heinbuch,  1988),  and  in 
neural  net  learning  (e.g.,  Mjolsness  et  al.,  1989b). 
The  transformational  approach  has  also  proved  use¬ 
ful  in  VLSI  design  (e.g.,  transforming  a  program  into 
a  microprocessor  (Martin,  Burns,  Lee,  Borkovic  & 
Hazewindus.  1989)]. 

We  have  shown  that  there  are  algebraic  transfor¬ 
mations  of  objective  functions  which  not  only  pre¬ 
serve  the  set  of  fixpoints  of  the  resulting  analog 


neural  network,  but  alter  the  number  and  nature  of 
interactions  requiring  physical  connections  between 
different  neurons.  Some  of  these  transformations  re¬ 
quire  the  network  to  find  a  saddle  point  rather  than 
a  minimum  of  the  objective,  but  that  can  be  ar¬ 
ranged.  Others  provide  control  over  the  dynamics 
by  which  an  objective  is  extremized.  A  set  of  such 
transformations  was  derived,  together  with  their  con¬ 
ditions  of  validity. 

Several  design  examples  were  given,  along  with 
experimental  results  to  show  convergence  with  rea¬ 
sonable  parameter  values.  Reduced-cost  designs 
were  presented  for  convolution  and  linear  coordinate 
transformation.  A  reduced  network  for  sorting  con¬ 
verged  to  approximately  correct  answers  despite  the 
use  of  an  illegitimate  transformation  which  intro¬ 
duced  spurious  fixpoints.  This  design  also  involved 
legitimate  but  repeated,  interacting  product-reduc¬ 
tion  transformations.  Transformed  networks  for 
quadratic  matching  and  for  registration  of  random 
dot  patterns  were  simulated  without  difficulty. 
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To  use  the  first  step  in  the  derivation  of  eqn  (21),  we  must 
evaluate 


j  du  j  do  g(u,  u)  =  |  du  f-'(ulu)  =  Y  J*  du  f~'[u) 
so 

Y  j*  f(u)  du - -  XY a  -  Yj'  du  f-<(u)  (56) 

since  a  may  be  rescaled  by  Y.  This  shows  that  both  sides  of  the 
transformation  (16)  may  be  multiplied  by  any  expression  Y. 

To  carry  through  transformation  (21)  we  must  calculate 


Mf 


du  g(u,  u) 


(«) 


=  J  do  J  du(-u/v:)(f~')'(u/o) 

■r 

-r 


(t0 


(») 

(<0 


-  j‘“  du  u(f->y(u) 

-J'  (a/v)  dz  /(z) 

=  Jrdu(-F(/-'(e/u))]-'  (u) 

=  j  dv[al - d))]  =  -c  j  dw 

=  -ef-'(-r) 
from  which  we  deduce  that 


(57) 


YF(X) - ►  -Xo  +  Yt  +  oF-\z).  (58) 

(r  and  a  have  been  rescaled  by  -  1.)  The  algebra  can  be  checked 
by  optimizing  with  respect  to  a.  Note  that  the  derivations  of  (56) 
and  (58)  assume  that  /  =  F  is  invertable  and  that  /■'  =  (F)~' 
is  differentiable  (i.e..  ((F)"1)'  exists). 


APPENDIX  B.  BUTTERFLY  NETWORKS 
B.l.  Back-to-back  Butterflies 

By  a  recursive  induction  argument  (Benes,  1965),  any  per¬ 
mutation  matrix  of  size  N  =  2"  (n  an  integer)  can  be  expressed 
by  setting  switches  in  two  back-to-back  butterfly  networks  inde¬ 
pendently,  as  shown  in  Figure  6.  If  we  label  the  corresponding 
connection  matrices  A  and  .4,  then  the  entire  network  represents 
a  permutation  matrix  M  in  the  form  of  a  matrix  product  M,  = 
(AA  r)„  =  2,  A.tA,,.  with  A  and  A  being  of  the  special  “butterfly" 
form.  Butterfly  networks  are  best  analyzed  by  introducing  binary 
notation  for  all  indices,  for  example, 

‘ - '(Pu  .  .  ■  p.)  -  P,  .  .  .  P„ 

i - ►(<?„...<?.)-- 9,  (59) 

In  this  notation,  the  outer  column  of  switches  in  the  A  butterfly 
has  the  form 

e  [0.  1]  (60) 


with  constraints 


APPENDIX  A.  REDUCING  YF{X) 

The  problem  is  to  use  eqn  (21)  to  reduce  expressions  of  the 
form  YF{X).  We  will  derive  rwo  versions:  the  first,  from  stopping 
after  the  first  step  in  the  derivation  of  (21);  the  second,  from 
carrying  the  transformation  all  the  way  through. 

Now 


1*  du  f{u)  =  j 

['du 

jr  do  g( 

u,  o) 

Yf(X)  = 

fr  1 

J  do  g{u,  u)j 

1  (X) 

J'  du  g(X.  o)  =  f-'(XIY) 

dp  g(u.  v)  =  (dfdo)f~'(u/o) 

=  -  (uiir)(f'')'{u/u). 


(55) 


2  flJV-M.  “  1  2  =  1  (61) 

P\ 


as  illustrated  by  the  butterfly  (s*i)  highlighted  in  Figure  6.  Likewise 
the  fth  column  of  switches  has  the  form 

e  (0.  1] 

with  2  =1  =  2  (62) 

n  « 


for  1  s  /  s  n.  With  constraints,  each  of  the  log  N  layers  contains 
Nl 2  bits,  which  is  one  reason  that  A  is  also  needed  to  specify  an 
entire  permutation  matrix. 

The  entire  permutation  matrix  is  obtained  by  finding  all  the 
possible  paths  through  the  network  from  t  to  /,  of  which  there  is 
only  one  since  each  stage  irrevocably  decides  one  bn  of  /.  This  is 
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from  eqn  (63), 


^S^ifSStfS^SSXgimid 
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FIGURE  6.  Butterfly  switching  networks.  Any  permutation  of 
N  =  2"  elements  can  be  represented  by  appropriately  setting 
the  switches  in  a  pair  of  back-to-back  butterfly  switching 
networks,  as  can  be  shown  recursively  by  Induction  on  n. 
Highlighted:  one  2x2  permutation  matrix  or  “butterfly", 
and  one  path  through  the  entire  switching  network. 

equivalent  to  a  conjunction  of  switch  settings: 


a*  =  n  B0U+.4 


=  (n  st. 

V«  l 


_  A{\)  Ul) 


nD(0 


and  as  in  eqn  (64)  one  can  derive  the  constraints 

2  =  ri  (s  flit..,,...,) 

-  [n  (2  Bit..,...,,)]  (2  b<^.,,^  =  1) 
=  ri  (2  Bit..,,...,,) 


=  1,  by  induction  on  k. 


Likewise 


2  <u  =  ri  (2  s<i 

■i  <•*  \  n 


^  =  ri  B^.,,„,r  (63) 

ui 

It  is  easy  to  check  that  this  is  a  permutation  matrix,  using  the 
constraints  on  A10.  (In  what  follows  the  terms  of  a  product  II  do 
not  commute  because  they  contain  summations  51  that  extend  over 
subsequent  terms  in  the  product.) 

2  A,  =  2 

=  ri  (2  Bjj.,..„._„) 

-  [fl  (2  Bi^„..„)](2  Bi;>„,...  =  1) 

=  ri  (2  b^..„  J 

=  1,  by  induction  on  n.  (64) 

Likewise 


=  [ri  (2  Bi?....„_,)](2  B<‘U,X  =  1) 

-  ri  (2  bj,7— 

=  1.  by  induction.  (65) 

B 3..  Coarse  Butterflies 


=  [ri  (2  b^..„„)](e  <u.„  =  1) 

=  ri  (2  b£.,..„_„) 

=  1 ,  by  induction.  (68) 

The  constraints  on  Am  and  A  are  similar.  Thus  the  constraints  on 
B'n  imply  the  less  restrictive  constraints  on  A('>  and  AlJ>,  which  we 
then  adopt  as  the  only  constraints  operating  on  A 

This  completes  the  proof  that  any  permutation  matrix  M  can 
be  represented  by  e{fPrt)  variables  in  the  form  of  eqn  (49)  with 
constraints  as  in  eqn  (50). 

APPENDIX  C.  FULL  BUTTERFLY  NEURAL  NETS 

The  arguments  of  Appendix  B  can  be  generalized  to  yield  a 
series  of  reduced  objectives  interpolating  between  0(Ana)  and  C(N 
log  iV)  variables,  each  having  about  the  same  number  of  connec¬ 
tions  as  variables.  In  Appendix  B  the  idea  was  to  express  each 
index  « in  base  VN  =  2"'1'*,  by  dividing  the  indices  p,  .  .  .  p„  of 
is  binary  expansion  in  two  groups.  We  may  instead  divide  the  n 
binary  indices  into  m  groups  of  size  k,  with  n  =  km,  deriving  the 
base  2*  expansion  i  =  i,  —  t_.  (We  have  dealt  with  the  special 
cases  k  =  n  and  k  —  nl 2.)  Then 

=  ri  bj* 

/-I 

=  ft  (ri 

1*1  V-l  / 

=  ft  C69) 

/•I 

as  in  eqn  (66).  The  constraints  on  A(0  are,  as  usual, 

2  =1  =  2  At***  (70) 

t  II 


A  less  restrictive  form  for  A  may  be  derived  as  follows.  Let  k 
be  roughly  n/2.  Then 

*i  ■  Pi  •  •  •  p».  1,  m  p*. ,  .  .  .  p., 

h  m  <h  ■  ■  ■  <7».  jt  m  <?»*,  ■■■  <1 A 


which  are  generally  less  restrictive  than  the  original  constraints 
on  the  butterfly  switches  B. 

From  eqn  (52),  our  problem  is  to  reduce 

£,(A)  =  -c,  2  <4,*.*,  (71) 


(where  e,  is  any  expression)  svatisnruung  cqn( 69)- 

If  we  rake  e,  =  C-  =  CL.,-,-  and  £'^)  ~  E'  (C  >' 
then  we  may  use  induction  on  a  to  reduce 

ET(C<*')  =  -c,  2  (a  a  2)' 


-  n 


C(-)  /-(•-»  4<-' 


e?  -  -I  2  f(  2 

*  •,r‘M  L 


+  2 


-  (  2 
x«i — »*— 1 


-  ^2  J 

-c,  f  2  2 

L.| — >n 

X  (*££*-*-.  - 

+  22 

‘—‘m  i\~it 

X  i-./V-J.- 1  ” 

+  j  2  2 


Note  that,  upon  identifying 

0.(«*n  _  j-u-D  = 

wc  have  an  induction  step  a  -+  a  —  1,  Va  £  2,  which  decreases 
the  number  of  neurons  and  connections  in  the  network.  By  in¬ 
duction  on  a,  one  may  reduce  eqn  (52)  to: 


£«*.  —  ~Ci  ["  2  ^•i--‘-jijc'i-‘. 

Li|— *arJ| 

x  ~  ^,m.n) 

+  22 


V,  _  .<*>  \ 

+  2  ^wi-i- 

x  (<rlc;'.L-,  -  «c:,!wu JK-v-  -  VJ 

+  )  2  2  j* 

+  (tJ.W  + 


x - ►  y  a - ►  a 

B - *  B  r  - *  t 

+  same,  with  - „  h  m - ,  £ 


+  i  2  l-aM—  +  bfi+-  +  H-j]-  (7S) 

The  number  of  variables  and  connections  for  this 
cXmA'i  *''"),  1  s  m  £  n,  which  takes  on  values  N*,  2JV  ,  d  , 

.  .  .  iV  log  TV- 


C  Appendix:  Bayesian  Inference  on  Visual  Grammars  by  Neural 
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